Differences

This shows you the differences between two versions of the page.

--- wiki:ai:dgx-spark-monitoring [2026/04/17 09:56] – created swilson
+++ wiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson
@@ Line 7: / Line 7: @@
   * **Grafana:** visualizes metrics in a live dashboard
   * **demo-load:** synthetic CPU + GPU load generator for testing the pipeline
-  * Everything runs on the DGX Spark (spark02) — Prometheus and Grafana run in Docker containers
+  * Everything runs on the DGX Spark — Prometheus and Grafana run in Docker containers on the same machine
 \\ \\
-=====Step 1 — SSH into the DGX Spark=====
+=====Step 1: SSH into the DGX Spark=====
-From your Mac terminal:
+From your Local terminal, SSH into the Spark:
+  ssh YOUR_USERNAME@YOUR_SPARK_IP
-  ssh swilson@100.91.118.118
 All steps below are run on the Spark unless noted otherwise.
 \\ \\
-=====Step 2 — Install Build Dependencies=====
+=====Step 2: Install Build Dependencies=====
   sudo apt install build-essential libncurses-dev -y
@@ Line 25: / Line 24: @@
   * **build-essential:** gcc, make, and standard C libraries
   * **libncurses-dev:** required for the terminal UI (ncursesw wide character support)
+If already installed, apt will report "already the newest version" and exit cleanly — that is fine.
 \\ \\
-=====Step 3 — Clone and Build nv-monitor=====
+=====Step 3: Clone and Build nv-monitor=====
   cd ~
@@ Line 34: / Line 35: @@
   make
-Verify it works by launching the interactive TUI:
+If the repo already exists from a previous run, git will print "destination path 'nv-monitor' already exists" and make will print "Nothing to be done for 'all'" — both are fine, the binary is already built.
+Verify it works by launching the interactive TUI:
   ./nv-monitor
@@ Line 44: / Line 46: @@
   * **GPU section:** utilization, temperature, power draw, clock speed
   * **Memory section:** used, buf/cache, swap
-  * **GPU Processes:** PID, user, type (C=compute, G=graphics), CPU%, GPU memory, command
+  * **VRAM:** shows "unified memory (shared with CPU)" on GB10 — this is expected, nvmlDeviceGetMemoryInfo returns NOT_SUPPORTED on the Grace-Blackwell unified memory architecture
   * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan)
 \\ \\
-=====Step 4 — Run nv-monitor as a Prometheus Exporter=====
+=====Step 4: Run nv-monitor as a Prometheus Exporter=====
 Start nv-monitor in headless mode with a Bearer token:
   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &
+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
+Replace ''YOUR_SECRET_TOKEN'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
 ====Flags explained====
-  * **-n:** headless mode — no TUI, runs silently in background
+  * **-n:** headless mode — no TUI, runs silently in the background
   * **-p 9101:** expose Prometheus metrics endpoint on port 9101
-  * **-t my-secret-token:** require this Bearer token on every request
+  * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request
   * **&:** run in background so the terminal stays free
-Verify it is working:
+On startup it prints:
+  Prometheus metrics at http://0.0.0.0:9101/metrics
+  Running headless (Ctrl+C to stop)
-  curl -s -H "Authorization: Bearer my-secret-token" localhost:9101/metrics | head -10
+Verify it is working:
+  curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
 You should see output starting with ''# HELP nv_build_info''.
@@ Line 78: / Line 84: @@
 \\ \\
-=====Step 5 — Create the Prometheus Configuration=====
+=====Step 5: Create the Prometheus Configuration=====
   mkdir ~/monitoring
@@ Line 85: / Line 91: @@
   global:
     scrape_interval: 5s
   scrape_configs:
     - job_name: 'nv-monitor'
       authorization:
-        credentials: 'my-secret-token'
+        credentials: 'YOUR_SECRET_TOKEN'
       static_configs:
         - targets: ['172.17.0.1:9101']
   EOF
+Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
 ====Why 172.17.0.1 and not localhost?====
   * Docker containers have their own network namespace
   * ''localhost'' inside a container refers to the container itself, not the host machine
-  * ''172.17.0.1'' is the Docker bridge gateway — the IP containers use to reach the host
+  * ''172.17.0.1'' is the Docker bridge gateway — the IP that containers use to reach the host
-  * Find it with: ''docker network inspect bridge | grep Gateway''
+  * Verify the gateway IP on your system: ''docker network inspect bridge | grep Gateway''
 \\ \\
-=====Step 6 — Start Prometheus and Grafana in Docker=====
+=====Step 6: Start Prometheus and Grafana in Docker=====
   docker run -d \
@@ Line 115: / Line 122: @@
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
   docker network create monitoring
   docker network connect monitoring prometheus
@@ Line 121: / Line 127: @@
 Verify both are healthy:
   docker ps
   curl -s localhost:9090/-/healthy
@@ Line 131: / Line 136: @@
 \\ \\
-=====Step 7 — Allow Docker Bridge to Reach nv-monitor=====
+=====Step 7: Allow Docker Bridge to Reach nv-monitor=====
+Docker containers live in the ''172.17.x.x'' subnet. The host firewall must allow them to reach port 9101.
-Docker containers live in the ''172.17.x.x'' subnet. The firewall blocks them from reaching port 9101 on the host by default.
+**Note:** The DGX Spark does not have UFW installed. Use iptables directly:
+  sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
-  sudo ufw allow from 172.17.0.0/16 to any port 9101
+This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host).
-This is the critical rule that allows Prometheus to scrape nv-monitor.
+====Note on SUDO POLICY VIOLATION broadcast messages====
+The Spark has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — this is just a notification to the admin team. It is not an error.
 \\ \\
-=====Step 8 — Access UIs from Your Mac via SSH Tunnel=====
+=====Step 8: Access UIs from Your Mac via SSH Tunnel=====
-Docker's iptables rules bypass UFW, making direct browser access unreliable. SSH port forwarding is simpler, more secure, and works over any network including Tailscale.
+SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale.
-On your **Mac**, open a new terminal:
+On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname):
+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
+Keep this terminal open. Then open in your Mac browser:
-Then open in your Mac browser:
   * **Prometheus:** http://localhost:9090/targets
   * **Grafana:** http://localhost:3000
+====Common mistake — running the tunnel from inside the Spark====
+If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with "Address already in use" — because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.
 ====Why SSH tunneling?====
-  * Works over Tailscale without needing to open firewall ports
+  * Works over Tailscale without needing to open additional firewall ports
-  * Encrypted by default — no plaintext traffic over the network
+  * Traffic is encrypted by default
-  * No additional firewall rules needed
   * Easy to disconnect by closing the terminal
 \\ \\
-=====Step 9 — Verify Prometheus is Scraping=====
+=====Step 9: Verify Prometheus is Scraping=====
 Open ''http://localhost:9090/targets'' in your browser.
-You should see the **nv-monitor** job with state **UP** and a scrape duration under 10ms.
+You should see the **nv-monitor** job listed with:
+  * State: **UP** (green)
+  * Scrape duration: under 10ms (typically ~2ms)
+If the state shows DOWN, see the Troubleshooting section.
 \\ \\
-=====Step 10 — Configure Grafana=====
+=====Step 10: Configure Grafana=====
 Open ''http://localhost:3000'' in your browser.
   * Login: **admin** / **admin**
-  * Change password when prompted
+  * Set a new password when prompted
 ====Add Prometheus as a data source====
@@ Line 183: / Line 197: @@
 ====Why ''http://prometheus:9090'' works====
-Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically.
+Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically. Using ''localhost:9090'' here would not work — it would refer to the Grafana container itself.
 \\ \\
-=====Step 11 — Build the Dashboard=====
+=====Step 11: Build the Dashboard=====
   - Click **Dashboards** → **New** → **New dashboard**
   - Click **+ Add visualization**
   - Add each panel below one at a time
+  - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization type, then click **Back to dashboard**
 ====Dashboard panels====
+  * **CPU Usage %** — metric: ''nv_cpu_usage_percent'' — type: Time series
+  * **CPU Temperature** — metric: ''nv_cpu_temperature_celsius'' — type: Time series
   * **GPU Utilization %** — metric: ''nv_gpu_utilization_percent'' — type: Time series
   * **GPU Power (W)** — metric: ''nv_gpu_power_watts'' — type: Time series
   * **GPU Temperature** — metric: ''nv_gpu_temperature_celsius'' — type: Time series
-  * **CPU Usage %** — metric: ''nv_cpu_usage_percent'' — type: Time series
-  * **CPU Temperature** — metric: ''nv_cpu_temperature_celsius'' — type: Time series
   * **Memory Used** — metric: ''nv_memory_used_bytes'' — type: Gauge — unit: bytes (SI)
-Save the dashboard as **DGX Spark Monitor**. Set auto-refresh to **10s**.
+Save the dashboard. Set auto-refresh to **10s** using the dropdown next to the Refresh button.
+====Important: select the correct data source when adding panels====
+When adding each panel, confirm the Data source dropdown shows the Prometheus data source you configured (not the default placeholder). If a panel shows "No data", check this first.
+====Panel shows No data====
+  - Change the time range to **Last 5 minutes** and click **Run queries**
+  - If still no data, click **Code** in the query editor and type the metric name directly, then run queries
+  - The GPU utilization panel will show a flat 0% line at idle — that is correct, not missing data
 \\ \\
-=====Step 12 — Load Test with demo-load=====
+=====Step 12: Load Test with demo-load=====
-Build the synthetic load generator:
+''demo-load'' is included in the nv-monitor repo and already built by ''make'' in Step 3.
   cd ~/nv-monitor
-  make demo-load
   ./demo-load --gpu
-This generates sinusoidal CPU and GPU load simultaneously. Watch the Grafana dashboard for live activity.
+Expected output:
+  Starting CPU load on 20 cores (sinusoidal, phased)
+  Starting GPU load on 1 GPU (sinusoidal)
+  Will stop in 5m 0s (Ctrl+C to stop early)
+  GPU 0: calibrating... done
+  GPU 0: load active
-Verify the GPU is under load:
+This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard — you should see all panels spike within a few seconds:
+  * GPU Power: rises from ~4.5W idle to ~12W under load
+  * CPU Usage %: cores hitting 80–100%
+  * GPU Utilization: rises from 0%
+  * CPU Temperature: climbs from ~45°C to ~70°C+
-  nvidia-smi
+Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically.
-Expected output shows:
-  * GPU-Util: ~40%
-  * Temperature: ~60°C
-  * Power: ~17W
-  * Process: ''./demo-load'' using ~170MiB GPU memory
-Press **Ctrl+C** to stop the load.
 \\ \\
@@ Line 230: / Line 253: @@
 nv-monitor and Docker containers do not auto-restart. To bring everything back:
-**On spark02:**
+**On the Spark:**
   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &
+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
   docker start prometheus grafana
-**On your Mac (new terminal):**
+**On your Mac (new local terminal):**
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
 Then open ''http://localhost:3000''.
@@ Line 255: / Line 278: @@
 ====nv-monitor binary does not exist after git clone====
-A file named ''nv-monitor'' already existed in the home directory (bad previous download).
+A file or directory named ''nv-monitor'' already existed in the home directory before cloning.
-  rm nv-monitor
+  rm -rf ~/nv-monitor
   git clone https://github.com/wentbackward/nv-monitor
   cd nv-monitor
@@ Line 263: / Line 286: @@
 ====Prometheus target shows DOWN — context deadline exceeded====
-Two causes, apply both fixes:
+Apply both fixes:
-**Fix 1** — Use the correct target IP in ''prometheus.yml'':
+**Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost:
   targets: ['172.17.0.1:9101']
-Then restart: ''docker restart prometheus''
+Find the correct gateway IP with: ''docker network inspect bridge | grep Gateway''
+Then restart Prometheus: ''docker restart prometheus''
 **Fix 2** — Allow Docker bridge through the firewall:
+  sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
-  sudo ufw allow from 172.17.0.0/16 to any port 9101
+====UFW command not found====
+The DGX Spark does not have UFW installed. Use iptables directly (see Step 7).
-====Grafana cannot connect to Prometheus — lookup prometheus: no such host====
+====SSH tunnel fails with "Address already in use"====
-Containers are not on the same Docker network.
+You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker containers binding ports 9090 and 3000. Open a new terminal on your Mac (prompt must show your Mac hostname, not the Spark) and run the tunnel from there.
+====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"====
+The containers are not on the same Docker network. Run:
   docker network create monitoring
   docker network connect monitoring prometheus
   docker network connect monitoring grafana
-Then set Grafana data source URL to ''http://prometheus:9090''.
+Then set the Grafana data source URL to ''http://prometheus:9090''.
 ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000====
-Docker bypasses UFW iptables rules. Use SSH tunnel instead:
+Docker's iptables rules can bypass UFW, making direct browser access unreliable. Use SSH tunneling instead (see Step 8).
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
-====CPU Usage % panel shows No data====
-The label filter ''{cpu="total"}'' does not exist. Remove the filter:
-  nv_cpu_usage_percent
-Also change visualization from Stat to Time series.
+====Grafana panel shows No data====
+. Check the Data source dropdown — must point to your configured Prometheus data source
+. Change time range to **Last 5 minutes** and click **Run queries**
+. Switch to **Code** mode and type the metric name directly
+. GPU utilization showing 0% at idle is correct — not an error
-====Memory Used shows raw number like 4003753984====
+====Memory Used shows a raw number like 4003753984====
-No unit set on the panel. Edit panel → Standard options → Unit → **bytes (SI)**.
+No unit is set on the panel. Edit the panel → Standard options → Unit → select **bytes (SI)**.
 ====SUDO POLICY VIOLATION broadcast messages====
-This is a sysadmin audit policy on the Spark. Commands still execute — this is just a notification to the admin team that sudo was used. It is not an error.
+This is a sysadmin audit policy. The command still executes — the broadcast is just a notification to the admin team. It is not an error.
 \\ \\
 [[wiki:ai:home-page|AI Home]]

Combined Cloud Managed Services

Site Tools

Differences

Page Tools