This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| wiki:ai:dgx-spark-monitoring [2026/04/17 11:00] – swilson | wiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| * **Grafana: | * **Grafana: | ||
| * **demo-load: | * **demo-load: | ||
| - | * Everything runs on the DGX Spark (spark02) | + | * Everything runs on the DGX Spark — Prometheus and Grafana run in Docker containers |
| \\ \\ | \\ \\ | ||
| - | =====Step 1 — SSH into the DGX Spark===== | + | =====Step 1: SSH into the DGX Spark===== |
| - | From your Mac terminal: | + | From your Local terminal, SSH into the Spark: |
| - | + | ssh YOUR_USERNAME@YOUR_SPARK_IP | |
| - | ssh swilson@100.91.118.118 | + | |
| All steps below are run on the Spark unless noted otherwise. | All steps below are run on the Spark unless noted otherwise. | ||
| - | |||
| - | [SCREENSHOT: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 2 — Install Build Dependencies===== | + | =====Step 2: Install Build Dependencies===== |
| sudo apt install build-essential libncurses-dev -y | sudo apt install build-essential libncurses-dev -y | ||
| Line 28: | Line 25: | ||
| * **libncurses-dev: | * **libncurses-dev: | ||
| - | Both packages were already installed on spark02 (12.10ubuntu1 and 6.4+20240113). | + | If already installed, apt will report |
| \\ \\ | \\ \\ | ||
| - | =====Step 3 — Clone and Build nv-monitor===== | + | =====Step 3: Clone and Build nv-monitor===== |
| cd ~ | cd ~ | ||
| Line 41: | Line 38: | ||
| Verify it works by launching the interactive TUI: | Verify it works by launching the interactive TUI: | ||
| - | |||
| ./ | ./ | ||
| Line 50: | Line 46: | ||
| * **GPU section:** utilization, | * **GPU section:** utilization, | ||
| * **Memory section:** used, buf/cache, swap | * **Memory section:** used, buf/cache, swap | ||
| - | * **VRAM:** shows " | + | * **VRAM:** shows " |
| * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan) | * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan) | ||
| - | |||
| - | [SCREENSHOT: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 4 — Run nv-monitor as a Prometheus Exporter===== | + | =====Step 4: Run nv-monitor as a Prometheus Exporter===== |
| Start nv-monitor in headless mode with a Bearer token: | Start nv-monitor in headless mode with a Bearer token: | ||
| - | |||
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| + | |||
| + | Replace '' | ||
| ====Flags explained==== | ====Flags explained==== | ||
| - | * **-n:** headless mode — no TUI, runs silently in background | + | * **-n:** headless mode — no TUI, runs silently in the background |
| * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | ||
| - | * **-t my-secret-token:** require this Bearer token on every request | + | * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request |
| * **&:** run in background so the terminal stays free | * **&:** run in background so the terminal stays free | ||
| Line 74: | Line 69: | ||
| Verify it is working: | Verify it is working: | ||
| - | + | | |
| - | | + | |
| You should see output starting with ''# | You should see output starting with ''# | ||
| - | |||
| - | [SCREENSHOT: | ||
| ====Available nv-monitor metrics==== | ====Available nv-monitor metrics==== | ||
| Line 92: | Line 84: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 5 — Create the Prometheus Configuration===== | + | =====Step 5: Create the Prometheus Configuration===== |
| mkdir ~/ | mkdir ~/ | ||
| Line 99: | Line 91: | ||
| global: | global: | ||
| scrape_interval: | scrape_interval: | ||
| - | |||
| scrape_configs: | scrape_configs: | ||
| - job_name: ' | - job_name: ' | ||
| authorization: | authorization: | ||
| - | credentials: | + | credentials: |
| static_configs: | static_configs: | ||
| - targets: [' | - targets: [' | ||
| EOF | EOF | ||
| + | |||
| + | Replace '' | ||
| ====Why 172.17.0.1 and not localhost? | ====Why 172.17.0.1 and not localhost? | ||
| * Docker containers have their own network namespace | * Docker containers have their own network namespace | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| - | * Find it with: '' | + | * Verify the gateway IP on your system: '' |
| \\ \\ | \\ \\ | ||
| - | =====Step 6 — Start Prometheus and Grafana in Docker===== | + | =====Step 6: Start Prometheus and Grafana in Docker===== |
| docker run -d \ | docker run -d \ | ||
| Line 129: | Line 122: | ||
| Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | ||
| - | |||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||
| Line 135: | Line 127: | ||
| Verify both are healthy: | Verify both are healthy: | ||
| - | |||
| docker ps | docker ps | ||
| curl -s localhost: | curl -s localhost: | ||
| Line 142: | Line 133: | ||
| Expected responses: | Expected responses: | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| \\ \\ | \\ \\ | ||
| - | =====Step 7 — Allow Docker Bridge to Reach nv-monitor===== | + | =====Step 7: Allow Docker Bridge to Reach nv-monitor===== |
| - | Docker containers live in the '' | + | Docker containers live in the '' |
| - | + | ||
| - | **Note:** spark02 does not have UFW installed. Use iptables directly: | + | |
| + | **Note:** The DGX Spark does not have UFW installed. Use iptables directly: | ||
| sudo iptables -I INPUT -s 172.17.0.0/ | sudo iptables -I INPUT -s 172.17.0.0/ | ||
| - | This is the critical rule that allows Prometheus to scrape nv-monitor. | + | This is the critical rule that allows Prometheus |
| - | ====SUDO POLICY VIOLATION broadcast messages==== | + | ====Note on SUDO POLICY VIOLATION broadcast messages==== |
| - | spark02 | + | The Spark has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — this is just a notification to the admin team. It is not an error. |
| \\ \\ | \\ \\ | ||
| - | =====Step 8 — Access UIs from Your Mac via SSH Tunnel===== | + | =====Step 8: Access UIs from Your Mac via SSH Tunnel===== |
| - | SSH port forwarding is simpler, more secure, and works over Tailscale | + | SSH port forwarding |
| - | On your **Mac**, open a **new local terminal** (not an SSH session — the prompt must show your Mac hostname, not spark02): | + | On your **Mac**, open a **new local terminal** (not an SSH session |
| - | + | ssh -L 9090: | |
| - | ssh -L 9090: | + | |
| Keep this terminal open. Then open in your Mac browser: | Keep this terminal open. Then open in your Mac browser: | ||
| Line 172: | Line 161: | ||
| ====Common mistake — running the tunnel from inside the Spark==== | ====Common mistake — running the tunnel from inside the Spark==== | ||
| - | If you run the SSH tunnel command from a terminal that is already SSH'd into spark02, it will SSH back to itself and fail with " | + | If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with " |
| ====Why SSH tunneling? | ====Why SSH tunneling? | ||
| - | * Works over Tailscale without needing to open firewall ports | + | * Works over Tailscale without needing to open additional |
| - | * Encrypted | + | * Traffic is encrypted |
| - | * No additional firewall rules needed | + | |
| * Easy to disconnect by closing the terminal | * Easy to disconnect by closing the terminal | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 9 — Verify Prometheus is Scraping===== | + | =====Step 9: Verify Prometheus is Scraping===== |
| Open '' | Open '' | ||
| - | You should see the **nv-monitor** job with state **UP** | + | You should see the **nv-monitor** job listed |
| + | * State: | ||
| + | * Scrape | ||
| - | [SCREENSHOT: | + | If the state shows DOWN, see the Troubleshooting section. |
| \\ \\ | \\ \\ | ||
| - | =====Step 10 — Configure Grafana===== | + | =====Step 10: Configure Grafana===== |
| Open '' | Open '' | ||
| * Login: **admin** / **admin** | * Login: **admin** / **admin** | ||
| - | * Change | + | * Set a new password when prompted |
| ====Add Prometheus as a data source==== | ====Add Prometheus as a data source==== | ||
| Line 205: | Line 195: | ||
| - Click **Save & test** | - Click **Save & test** | ||
| - You should see: **Successfully queried the Prometheus API** | - You should see: **Successfully queried the Prometheus API** | ||
| - | |||
| - | [SCREENSHOT: | ||
| ====Why '' | ====Why '' | ||
| - | Both containers are on the same Docker network ('' | + | Both containers are on the same Docker network ('' |
| \\ \\ | \\ \\ | ||
| - | =====Step 11 — Build the Dashboard===== | + | =====Step 11: Build the Dashboard===== |
| - | - Click **Dashboards** → **New** → **New dashboard** | + | - Click **Dashboards** → **New** → **New dashboard** |
| + | - Click **+ Add visualization** | ||
| - Add each panel below one at a time | - Add each panel below one at a time | ||
| - | - For each panel: select metric in Builder, set title in right panel options, confirm | + | - For each panel: select |
| ====Dashboard panels==== | ====Dashboard panels==== | ||
| Line 226: | Line 215: | ||
| * **Memory Used** — metric: '' | * **Memory Used** — metric: '' | ||
| - | Save the dashboard | + | Save the dashboard. Set auto-refresh to **10s** |
| - | ====Important: | + | ====Important: |
| - | When adding | + | When adding |
| ====Panel shows No data==== | ====Panel shows No data==== | ||
| - | Switch | + | - Change the time range to **Last 5 minutes** and click **Run queries** |
| + | - If still no data, click **Code** in the query editor and type the metric name directly, then run queries | ||
| + | - The GPU utilization panel will show a flat 0% line at idle — that is correct, not missing data | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 12 — Load Test with demo-load===== | + | =====Step 12: Load Test with demo-load===== |
| - | demo-load is included in the nv-monitor repo and built by default with '' | + | '' |
| cd ~/ | cd ~/ | ||
| ./demo-load --gpu | ./demo-load --gpu | ||
| - | Output: | + | Expected output: |
| Starting CPU load on 20 cores (sinusoidal, | Starting CPU load on 20 cores (sinusoidal, | ||
| Starting GPU load on 1 GPU (sinusoidal) | Starting GPU load on 1 GPU (sinusoidal) | ||
| Will stop in 5m 0s (Ctrl+C to stop early) | Will stop in 5m 0s (Ctrl+C to stop early) | ||
| - | GPU 0: calibrating... done (kernel=0.01ms, | + | GPU 0: calibrating... done |
| GPU 0: load active | GPU 0: load active | ||
| - | This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard | + | This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard |
| - | + | | |
| - | [SCREENSHOT: demo-load terminal output showing GPU 0: load active] | + | * CPU Usage %: cores hitting 80–100% |
| - | + | * GPU Utilization: rises from 0% | |
| - | [SCREENSHOT: | + | * CPU Temperature: climbs |
| Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically. | Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically. | ||
| Line 262: | Line 253: | ||
| nv-monitor and Docker containers do not auto-restart. To bring everything back: | nv-monitor and Docker containers do not auto-restart. To bring everything back: | ||
| - | **On spark02:** | + | **On the Spark:** |
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| docker start prometheus grafana | docker start prometheus grafana | ||
| **On your Mac (new local terminal): | **On your Mac (new local terminal): | ||
| - | ssh -L 9090: | + | ssh -L 9090: |
| Then open '' | Then open '' | ||
| Line 287: | Line 278: | ||
| ====nv-monitor binary does not exist after git clone==== | ====nv-monitor binary does not exist after git clone==== | ||
| - | A file named '' | + | A file or directory |
| - | rm nv-monitor | + | rm -rf ~/nv-monitor |
| git clone https:// | git clone https:// | ||
| cd nv-monitor | cd nv-monitor | ||
| Line 295: | Line 286: | ||
| ====Prometheus target shows DOWN — context deadline exceeded==== | ====Prometheus target shows DOWN — context deadline exceeded==== | ||
| - | Two causes, apply both fixes: | + | Apply both fixes: |
| - | + | ||
| - | **Fix 1** — Use the correct target IP in '' | + | |
| + | **Fix 1** — Use the correct target IP in '' | ||
| targets: [' | targets: [' | ||
| - | Then restart: '' | + | Find the correct gateway IP with: '' |
| - | **Fix 2** — Allow Docker bridge through the firewall (spark02 uses iptables, not UFW): | + | Then restart Prometheus: '' |
| + | **Fix 2** — Allow Docker bridge through the firewall: | ||
| sudo iptables -I INPUT -s 172.17.0.0/ | sudo iptables -I INPUT -s 172.17.0.0/ | ||
| ====UFW command not found==== | ====UFW command not found==== | ||
| - | spark02 | + | The DGX Spark does not have UFW installed. Use iptables directly (see Step 7). |
| ====SSH tunnel fails with " | ====SSH tunnel fails with " | ||
| - | You ran the tunnel command from inside an existing SSH session to spark02 instead of from a Mac local terminal. Open a new terminal on your Mac (prompt | + | You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker containers binding ports 9090 and 3000. Open a new terminal on your Mac (prompt |
| - | + | ||
| - | ====Grafana cannot connect to Prometheus — lookup prometheus: no such host==== | + | |
| - | Containers are not on the same Docker network. | + | |
| + | ====Grafana cannot connect to Prometheus — " | ||
| + | The containers are not on the same Docker network. Run: | ||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||
| docker network connect monitoring grafana | docker network connect monitoring grafana | ||
| - | Then set Grafana data source URL to '' | + | Then set the Grafana data source URL to '' |
| ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== | ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== | ||
| - | Docker | + | Docker' |
| - | + | ||
| - | ssh -L 9090: | + | |
| ====Grafana panel shows No data==== | ====Grafana panel shows No data==== | ||
| - | 1. Check the Data source dropdown — must be **prometheus-10**, | + | 1. Check the Data source dropdown — must point to your configured Prometheus data source |
| 2. Change time range to **Last 5 minutes** and click **Run queries** | 2. Change time range to **Last 5 minutes** and click **Run queries** | ||
| 3. Switch to **Code** mode and type the metric name directly | 3. Switch to **Code** mode and type the metric name directly | ||
| - | 4. GPU utilization | + | 4. GPU utilization |
| - | ====Memory Used shows raw number like 4003753984==== | + | ====Memory Used shows a raw number like 4003753984==== |
| - | No unit set on the panel. Edit panel → Standard options → Unit → **bytes (SI)**. | + | No unit is set on the panel. Edit the panel → Standard options → Unit → select |
| ====SUDO POLICY VIOLATION broadcast messages==== | ====SUDO POLICY VIOLATION broadcast messages==== | ||
| - | This is a sysadmin audit policy | + | This is a sysadmin audit policy. |
| \\ \\ | \\ \\ | ||
| [[wiki: | [[wiki: | ||