This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| wiki:ai:dgx-spark-monitoring [2026/04/17 09:56] – created swilson | wiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| * **Grafana: | * **Grafana: | ||
| * **demo-load: | * **demo-load: | ||
| - | * Everything runs on the DGX Spark (spark02) | + | * Everything runs on the DGX Spark — Prometheus and Grafana run in Docker containers |
| \\ \\ | \\ \\ | ||
| - | =====Step 1 — SSH into the DGX Spark===== | + | =====Step 1: SSH into the DGX Spark===== |
| - | From your Mac terminal: | + | From your Local terminal, SSH into the Spark: |
| - | + | ssh YOUR_USERNAME@YOUR_SPARK_IP | |
| - | ssh swilson@100.91.118.118 | + | |
| All steps below are run on the Spark unless noted otherwise. | All steps below are run on the Spark unless noted otherwise. | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 2 — Install Build Dependencies===== | + | =====Step 2: Install Build Dependencies===== |
| sudo apt install build-essential libncurses-dev -y | sudo apt install build-essential libncurses-dev -y | ||
| Line 25: | Line 24: | ||
| * **build-essential: | * **build-essential: | ||
| * **libncurses-dev: | * **libncurses-dev: | ||
| + | |||
| + | If already installed, apt will report " | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 3 — Clone and Build nv-monitor===== | + | =====Step 3: Clone and Build nv-monitor===== |
| cd ~ | cd ~ | ||
| Line 34: | Line 35: | ||
| make | make | ||
| - | Verify it works by launching | + | If the repo already exists from a previous run, git will print " |
| + | Verify it works by launching the interactive TUI: | ||
| ./ | ./ | ||
| Line 44: | Line 46: | ||
| * **GPU section:** utilization, | * **GPU section:** utilization, | ||
| * **Memory section:** used, buf/cache, swap | * **Memory section:** used, buf/cache, swap | ||
| - | * **GPU Processes:** PID, user, type (C=compute, G=graphics), CPU%, GPU memory, command | + | * **VRAM:** shows " |
| * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan) | * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan) | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 4 — Run nv-monitor as a Prometheus Exporter===== | + | =====Step 4: Run nv-monitor as a Prometheus Exporter===== |
| Start nv-monitor in headless mode with a Bearer token: | Start nv-monitor in headless mode with a Bearer token: | ||
| - | |||
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| + | |||
| + | Replace '' | ||
| ====Flags explained==== | ====Flags explained==== | ||
| - | * **-n:** headless mode — no TUI, runs silently in background | + | * **-n:** headless mode — no TUI, runs silently in the background |
| * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | ||
| - | * **-t my-secret-token:** require this Bearer token on every request | + | * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request |
| * **&:** run in background so the terminal stays free | * **&:** run in background so the terminal stays free | ||
| - | Verify | + | On startup |
| + | Prometheus metrics at http:// | ||
| + | Running headless (Ctrl+C to stop) | ||
| - | | + | Verify it is working: |
| + | | ||
| You should see output starting with ''# | You should see output starting with ''# | ||
| Line 78: | Line 84: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 5 — Create the Prometheus Configuration===== | + | =====Step 5: Create the Prometheus Configuration===== |
| mkdir ~/ | mkdir ~/ | ||
| Line 85: | Line 91: | ||
| global: | global: | ||
| scrape_interval: | scrape_interval: | ||
| - | |||
| scrape_configs: | scrape_configs: | ||
| - job_name: ' | - job_name: ' | ||
| authorization: | authorization: | ||
| - | credentials: | + | credentials: |
| static_configs: | static_configs: | ||
| - targets: [' | - targets: [' | ||
| EOF | EOF | ||
| + | |||
| + | Replace '' | ||
| ====Why 172.17.0.1 and not localhost? | ====Why 172.17.0.1 and not localhost? | ||
| * Docker containers have their own network namespace | * Docker containers have their own network namespace | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| - | * Find it with: '' | + | * Verify the gateway IP on your system: '' |
| \\ \\ | \\ \\ | ||
| - | =====Step 6 — Start Prometheus and Grafana in Docker===== | + | =====Step 6: Start Prometheus and Grafana in Docker===== |
| docker run -d \ | docker run -d \ | ||
| Line 115: | Line 122: | ||
| Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | ||
| - | |||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||
| Line 121: | Line 127: | ||
| Verify both are healthy: | Verify both are healthy: | ||
| - | |||
| docker ps | docker ps | ||
| curl -s localhost: | curl -s localhost: | ||
| Line 131: | Line 136: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 7 — Allow Docker Bridge to Reach nv-monitor===== | + | =====Step 7: Allow Docker Bridge to Reach nv-monitor===== |
| + | |||
| + | Docker containers live in the '' | ||
| - | Docker containers live in the '' | + | **Note:** The DGX Spark does not have UFW installed. Use iptables directly: |
| + | sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport | ||
| - | sudo ufw allow from 172.17.0.0/16 to any port 9101 | + | This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host). |
| - | This is the critical rule that allows Prometheus | + | ====Note on SUDO POLICY VIOLATION broadcast messages==== |
| + | The Spark has a sysadmin audit policy | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 8 — Access UIs from Your Mac via SSH Tunnel===== | + | =====Step 8: Access UIs from Your Mac via SSH Tunnel===== |
| - | Docker' | + | SSH port forwarding |
| - | On your **Mac**, open a new terminal: | + | On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): |
| + | ssh -L 9090: | ||
| - | ssh -L 9090: | + | Keep this terminal open. Then open in your Mac browser: |
| - | + | ||
| - | Then open in your Mac browser: | + | |
| * **Prometheus: | * **Prometheus: | ||
| * **Grafana: | * **Grafana: | ||
| + | |||
| + | ====Common mistake — running the tunnel from inside the Spark==== | ||
| + | If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with " | ||
| ====Why SSH tunneling? | ====Why SSH tunneling? | ||
| - | * Works over Tailscale without needing to open firewall ports | + | * Works over Tailscale without needing to open additional |
| - | * Encrypted | + | * Traffic is encrypted |
| - | * No additional firewall rules needed | + | |
| * Easy to disconnect by closing the terminal | * Easy to disconnect by closing the terminal | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 9 — Verify Prometheus is Scraping===== | + | =====Step 9: Verify Prometheus is Scraping===== |
| Open '' | Open '' | ||
| - | You should see the **nv-monitor** job with state **UP** | + | You should see the **nv-monitor** job listed |
| + | * State: | ||
| + | * Scrape | ||
| + | |||
| + | If the state shows DOWN, see the Troubleshooting section. | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 10 — Configure Grafana===== | + | =====Step 10: Configure Grafana===== |
| Open '' | Open '' | ||
| * Login: **admin** / **admin** | * Login: **admin** / **admin** | ||
| - | * Change | + | * Set a new password when prompted |
| ====Add Prometheus as a data source==== | ====Add Prometheus as a data source==== | ||
| Line 183: | Line 197: | ||
| ====Why '' | ====Why '' | ||
| - | Both containers are on the same Docker network ('' | + | Both containers are on the same Docker network ('' |
| \\ \\ | \\ \\ | ||
| - | =====Step 11 — Build the Dashboard===== | + | =====Step 11: Build the Dashboard===== |
| - Click **Dashboards** → **New** → **New dashboard** | - Click **Dashboards** → **New** → **New dashboard** | ||
| - Click **+ Add visualization** | - Click **+ Add visualization** | ||
| - Add each panel below one at a time | - Add each panel below one at a time | ||
| + | - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization type, then click **Back to dashboard** | ||
| ====Dashboard panels==== | ====Dashboard panels==== | ||
| + | * **CPU Usage %** — metric: '' | ||
| + | * **CPU Temperature** — metric: '' | ||
| * **GPU Utilization %** — metric: '' | * **GPU Utilization %** — metric: '' | ||
| * **GPU Power (W)** — metric: '' | * **GPU Power (W)** — metric: '' | ||
| * **GPU Temperature** — metric: '' | * **GPU Temperature** — metric: '' | ||
| - | * **CPU Usage %** — metric: '' | ||
| - | * **CPU Temperature** — metric: '' | ||
| * **Memory Used** — metric: '' | * **Memory Used** — metric: '' | ||
| - | Save the dashboard | + | Save the dashboard. Set auto-refresh to **10s** |
| + | |||
| + | ====Important: | ||
| + | When adding each panel, confirm the Data source dropdown shows the Prometheus data source you configured (not the default placeholder). If a panel shows "No data", check this first. | ||
| + | |||
| + | ====Panel shows No data==== | ||
| + | - Change the time range to **Last 5 minutes** and click **Run queries** | ||
| + | - If still no data, click **Code** in the query editor and type the metric name directly, then run queries | ||
| + | - The GPU utilization panel will show a flat 0% line at idle — that is correct, not missing data | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 12 — Load Test with demo-load===== | + | =====Step 12: Load Test with demo-load===== |
| - | Build the synthetic | + | '' |
| cd ~/ | cd ~/ | ||
| - | make demo-load | ||
| ./demo-load --gpu | ./demo-load --gpu | ||
| - | This generates sinusoidal | + | Expected output: |
| + | Starting | ||
| + | Starting | ||
| + | Will stop in 5m 0s (Ctrl+C to stop early) | ||
| + | GPU 0: calibrating... done | ||
| + | GPU 0: load active | ||
| - | Verify | + | This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard — you should see all panels spike within a few seconds: |
| + | * GPU Power: rises from ~4.5W idle to ~12W under load | ||
| + | * CPU Usage %: cores hitting 80–100% | ||
| + | * GPU Utilization: | ||
| + | * CPU Temperature: climbs from ~45°C to ~70°C+ | ||
| - | nvidia-smi | + | Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically. |
| - | + | ||
| - | Expected output shows: | + | |
| - | * GPU-Util: ~40% | + | |
| - | * Temperature: | + | |
| - | * Power: ~17W | + | |
| - | * Process: '' | + | |
| - | + | ||
| - | Press **Ctrl+C** to stop the load. | + | |
| \\ \\ | \\ \\ | ||
| Line 230: | Line 253: | ||
| nv-monitor and Docker containers do not auto-restart. To bring everything back: | nv-monitor and Docker containers do not auto-restart. To bring everything back: | ||
| - | **On spark02:** | + | **On the Spark:** |
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| docker start prometheus grafana | docker start prometheus grafana | ||
| - | **On your Mac (new terminal): | + | **On your Mac (new local terminal): |
| - | ssh -L 9090: | + | ssh -L 9090: |
| Then open '' | Then open '' | ||
| Line 255: | Line 278: | ||
| ====nv-monitor binary does not exist after git clone==== | ====nv-monitor binary does not exist after git clone==== | ||
| - | A file named '' | + | A file or directory |
| - | rm nv-monitor | + | rm -rf ~/nv-monitor |
| git clone https:// | git clone https:// | ||
| cd nv-monitor | cd nv-monitor | ||
| Line 263: | Line 286: | ||
| ====Prometheus target shows DOWN — context deadline exceeded==== | ====Prometheus target shows DOWN — context deadline exceeded==== | ||
| - | Two causes, apply both fixes: | + | Apply both fixes: |
| - | + | ||
| - | **Fix 1** — Use the correct target IP in '' | + | |
| + | **Fix 1** — Use the correct target IP in '' | ||
| targets: [' | targets: [' | ||
| - | Then restart: '' | + | Find the correct gateway IP with: '' |
| + | |||
| + | Then restart | ||
| **Fix 2** — Allow Docker bridge through the firewall: | **Fix 2** — Allow Docker bridge through the firewall: | ||
| + | sudo iptables -I INPUT -s 172.17.0.0/ | ||
| - | sudo ufw allow from 172.17.0.0/16 to any port 9101 | + | ====UFW command not found==== |
| + | The DGX Spark does not have UFW installed. Use iptables directly (see Step 7). | ||
| - | ====Grafana cannot connect to Prometheus — lookup prometheus: no such host==== | + | ====SSH tunnel fails with " |
| - | Containers are not on the same Docker | + | You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker |
| + | ====Grafana cannot connect to Prometheus — " | ||
| + | The containers are not on the same Docker network. Run: | ||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||
| docker network connect monitoring grafana | docker network connect monitoring grafana | ||
| - | Then set Grafana data source URL to '' | + | Then set the Grafana data source URL to '' |
| ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== | ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== | ||
| - | Docker | + | Docker' |
| - | + | ||
| - | ssh -L 9090: | + | |
| - | + | ||
| - | ====CPU Usage % panel shows No data==== | + | |
| - | The label filter '' | + | |
| - | + | ||
| - | nv_cpu_usage_percent | + | |
| - | Also change visualization from Stat to Time series. | + | ====Grafana panel shows No data==== |
| + | 1. Check the Data source dropdown — must point to your configured Prometheus data source | ||
| + | 2. Change time range to **Last 5 minutes** and click **Run queries** | ||
| + | 3. Switch | ||
| + | 4. GPU utilization showing 0% at idle is correct — not an error | ||
| - | ====Memory Used shows raw number like 4003753984==== | + | ====Memory Used shows a raw number like 4003753984==== |
| - | No unit set on the panel. Edit panel → Standard options → Unit → **bytes (SI)**. | + | No unit is set on the panel. Edit the panel → Standard options → Unit → select |
| ====SUDO POLICY VIOLATION broadcast messages==== | ====SUDO POLICY VIOLATION broadcast messages==== | ||
| - | This is a sysadmin audit policy | + | This is a sysadmin audit policy. |
| \\ \\ | \\ \\ | ||
| [[wiki: | [[wiki: | ||