User Tools

Site Tools


wiki:ai:dgx-spark-monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:ai:dgx-spark-monitoring [2026/04/17 11:00] swilsonwiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson
Line 7: Line 7:
   * **Grafana:** visualizes metrics in a live dashboard   * **Grafana:** visualizes metrics in a live dashboard
   * **demo-load:** synthetic CPU + GPU load generator for testing the pipeline   * **demo-load:** synthetic CPU + GPU load generator for testing the pipeline
-  * Everything runs on the DGX Spark (spark02) — Prometheus and Grafana run in Docker containers+  * Everything runs on the DGX Spark — Prometheus and Grafana run in Docker containers on the same machine
  
 \\ \\ \\ \\
-=====Step 1 — SSH into the DGX Spark=====+=====Step 1SSH into the DGX Spark=====
  
-From your Mac terminal: +From your Local terminal, SSH into the Spark
- +  ssh YOUR_USERNAME@YOUR_SPARK_IP
-  ssh swilson@100.91.118.118+
  
 All steps below are run on the Spark unless noted otherwise. All steps below are run on the Spark unless noted otherwise.
- 
-[SCREENSHOT: Mac terminal showing swilson@spark02:~$ prompt after successful SSH login] 
  
 \\ \\ \\ \\
-=====Step 2 — Install Build Dependencies=====+=====Step 2Install Build Dependencies=====
  
   sudo apt install build-essential libncurses-dev -y   sudo apt install build-essential libncurses-dev -y
Line 28: Line 25:
   * **libncurses-dev:** required for the terminal UI (ncursesw wide character support)   * **libncurses-dev:** required for the terminal UI (ncursesw wide character support)
  
-Both packages were already installed on spark02 (12.10ubuntu1 and 6.4+20240113). If already installed, apt reports "already the newest version" and exits cleanly.+If already installed, apt will report "already the newest version" and exit cleanly — that is fine.
  
 \\ \\ \\ \\
-=====Step 3 — Clone and Build nv-monitor=====+=====Step 3Clone and Build nv-monitor=====
  
   cd ~   cd ~
Line 41: Line 38:
  
 Verify it works by launching the interactive TUI: Verify it works by launching the interactive TUI:
- 
   ./nv-monitor   ./nv-monitor
  
Line 50: Line 46:
   * **GPU section:** utilization, temperature, power draw, clock speed   * **GPU section:** utilization, temperature, power draw, clock speed
   * **Memory section:** used, buf/cache, swap   * **Memory section:** used, buf/cache, swap
-  * **VRAM:** shows "unified memory (shared with CPU)" on GB10 — nvmlDeviceGetMemoryInfo returns NOT_SUPPORTED, which is expected+  * **VRAM:** shows "unified memory (shared with CPU)" on GB10 — this is expected, nvmlDeviceGetMemoryInfo returns NOT_SUPPORTED on the Grace-Blackwell unified memory architecture
   * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan)   * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan)
- 
-[SCREENSHOT: nv-monitor TUI showing all 20 cores (0-9 X725 efficiency, 10-19 X925 performance), GPU 0 NVIDIA GB10 at 42C 4.7W 208MHz, MEM 5.4G used / 121.7G, unified memory label, uptime 11d 17h] 
  
 \\ \\ \\ \\
-=====Step 4 — Run nv-monitor as a Prometheus Exporter=====+=====Step 4Run nv-monitor as a Prometheus Exporter=====
  
 Start nv-monitor in headless mode with a Bearer token: Start nv-monitor in headless mode with a Bearer token:
- 
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN & 
 + 
 +Replace ''YOUR_SECRET_TOKEN'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
  
 ====Flags explained==== ====Flags explained====
-  * **-n:** headless mode — no TUI, runs silently in background+  * **-n:** headless mode — no TUI, runs silently in the background
   * **-p 9101:** expose Prometheus metrics endpoint on port 9101   * **-p 9101:** expose Prometheus metrics endpoint on port 9101
-  * **-t my-secret-token:** require this Bearer token on every request+  * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request
   * **&:** run in background so the terminal stays free   * **&:** run in background so the terminal stays free
  
Line 74: Line 69:
  
 Verify it is working: Verify it is working:
- +  curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
-  curl -s -H "Authorization: Bearer my-secret-token" localhost:9101/metrics | head -10+
  
 You should see output starting with ''# HELP nv_build_info''. You should see output starting with ''# HELP nv_build_info''.
- 
-[SCREENSHOT: Terminal showing nv-monitor background process (PID 52653) and curl output with # HELP nv_build_info, nv_uptime_seconds, nv_load_average metrics] 
  
 ====Available nv-monitor metrics==== ====Available nv-monitor metrics====
Line 92: Line 84:
  
 \\ \\ \\ \\
-=====Step 5 — Create the Prometheus Configuration=====+=====Step 5Create the Prometheus Configuration=====
  
   mkdir ~/monitoring   mkdir ~/monitoring
Line 99: Line 91:
   global:   global:
     scrape_interval: 5s     scrape_interval: 5s
- 
   scrape_configs:   scrape_configs:
     - job_name: 'nv-monitor'     - job_name: 'nv-monitor'
       authorization:       authorization:
-        credentials: 'my-secret-token'+        credentials: 'YOUR_SECRET_TOKEN'
       static_configs:       static_configs:
         - targets: ['172.17.0.1:9101']         - targets: ['172.17.0.1:9101']
   EOF   EOF
 +
 +Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
  
 ====Why 172.17.0.1 and not localhost?==== ====Why 172.17.0.1 and not localhost?====
   * Docker containers have their own network namespace   * Docker containers have their own network namespace
   * ''localhost'' inside a container refers to the container itself, not the host machine   * ''localhost'' inside a container refers to the container itself, not the host machine
-  * ''172.17.0.1'' is the Docker bridge gateway — the IP containers use to reach the host +  * ''172.17.0.1'' is the Docker bridge gateway — the IP that containers use to reach the host 
-  * Find it with: ''docker network inspect bridge | grep Gateway''+  * Verify the gateway IP on your system: ''docker network inspect bridge | grep Gateway''
  
 \\ \\ \\ \\
-=====Step 6 — Start Prometheus and Grafana in Docker=====+=====Step 6Start Prometheus and Grafana in Docker=====
  
   docker run -d \   docker run -d \
Line 129: Line 122:
  
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
Line 135: Line 127:
  
 Verify both are healthy: Verify both are healthy:
- 
   docker ps   docker ps
   curl -s localhost:9090/-/healthy   curl -s localhost:9090/-/healthy
Line 142: Line 133:
 Expected responses: Expected responses:
   * ''Prometheus Server is Healthy.''   * ''Prometheus Server is Healthy.''
-  * ''{"database":"ok","version":"12.4.2",...}''+  * ''{"database":"ok",...}''
  
 \\ \\ \\ \\
-=====Step 7 — Allow Docker Bridge to Reach nv-monitor=====+=====Step 7Allow Docker Bridge to Reach nv-monitor=====
  
-Docker containers live in the ''172.17.x.x'' subnet. The firewall must allow them to reach port 9101 on the host. +Docker containers live in the ''172.17.x.x'' subnet. The host firewall must allow them to reach port 9101.
- +
-**Note:** spark02 does not have UFW installedUse iptables directly:+
  
 +**Note:** The DGX Spark does not have UFW installed. Use iptables directly:
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
-This is the critical rule that allows Prometheus to scrape nv-monitor.+This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host).
  
-====SUDO POLICY VIOLATION broadcast messages==== +====Note on SUDO POLICY VIOLATION broadcast messages==== 
-spark02 has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — the broadcast is just a notification to the admin team. It is not an error.+The Spark has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — this is just a notification to the admin team. It is not an error.
  
 \\ \\ \\ \\
-=====Step 8 — Access UIs from Your Mac via SSH Tunnel=====+=====Step 8Access UIs from Your Mac via SSH Tunnel=====
  
-SSH port forwarding is simplermore secure, and works over Tailscale without opening firewall ports.+SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale.
  
-On your **Mac**, open a **new local terminal** (not an SSH session — the prompt must show your Mac hostname, not spark02): +On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): 
- +  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118+
  
 Keep this terminal open. Then open in your Mac browser: Keep this terminal open. Then open in your Mac browser:
Line 172: Line 161:
  
 ====Common mistake — running the tunnel from inside the Spark==== ====Common mistake — running the tunnel from inside the Spark====
-If you run the SSH tunnel command from a terminal that is already SSH'd into spark02, it will SSH back to itself and fail with "Address already in use" because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.+If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with "Address already in use" — because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.
  
 ====Why SSH tunneling?==== ====Why SSH tunneling?====
-  * Works over Tailscale without needing to open firewall ports +  * Works over Tailscale without needing to open additional firewall ports 
-  * Encrypted by default — no plaintext traffic over the network +  * Traffic is encrypted by default
-  * No additional firewall rules needed+
   * Easy to disconnect by closing the terminal   * Easy to disconnect by closing the terminal
  
 \\ \\ \\ \\
-=====Step 9 — Verify Prometheus is Scraping=====+=====Step 9Verify Prometheus is Scraping=====
  
 Open ''http://localhost:9090/targets'' in your browser. Open ''http://localhost:9090/targets'' in your browser.
  
-You should see the **nv-monitor** job with state **UP** and a scrape duration of ~2ms.+You should see the **nv-monitor** job listed with
 +  * State: **UP** (green) 
 +  * Scrape duration: under 10ms (typically ~2ms)
  
-[SCREENSHOT: Prometheus Target health page showing nv-monitor jobendpoint http://172.17.0.1:9101/metrics, state UP (green), last scrape 11s ago, duration 2ms]+If the state shows DOWNsee the Troubleshooting section.
  
 \\ \\ \\ \\
-=====Step 10 — Configure Grafana=====+=====Step 10Configure Grafana=====
  
 Open ''http://localhost:3000'' in your browser. Open ''http://localhost:3000'' in your browser.
  
   * Login: **admin** / **admin**   * Login: **admin** / **admin**
-  * Change password when prompted+  * Set a new password when prompted
  
 ====Add Prometheus as a data source==== ====Add Prometheus as a data source====
Line 205: Line 195:
   - Click **Save & test**   - Click **Save & test**
   - You should see: **Successfully queried the Prometheus API**   - You should see: **Successfully queried the Prometheus API**
- 
-[SCREENSHOT: Grafana data source config page showing URL http://prometheus:9090 and green "Successfully queried the Prometheus API" confirmation banner] 
  
 ====Why ''http://prometheus:9090'' works==== ====Why ''http://prometheus:9090'' works====
-Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically.+Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically. Using ''localhost:9090'' here would not work — it would refer to the Grafana container itself.
  
 \\ \\ \\ \\
-=====Step 11 — Build the Dashboard=====+=====Step 11Build the Dashboard=====
  
-  - Click **Dashboards** → **New** → **New dashboard** → **+ Add visualization**+  - Click **Dashboards** → **New** → **New dashboard** 
 +  - Click **+ Add visualization**
   - Add each panel below one at a time   - Add each panel below one at a time
-  - For each panel: select metric in Builder, set title in right panel options, confirm Time series visualization, click Back to dashboard+  - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization typethen click **Back to dashboard**
  
 ====Dashboard panels==== ====Dashboard panels====
Line 226: Line 215:
   * **Memory Used** — metric: ''nv_memory_used_bytes'' — type: Gauge — unit: bytes (SI)   * **Memory Used** — metric: ''nv_memory_used_bytes'' — type: Gauge — unit: bytes (SI)
  
-Save the dashboard as **DGX Spark Monitor**. Set auto-refresh to **10s**.+Save the dashboard. Set auto-refresh to **10s** using the dropdown next to the Refresh button.
  
-====Important: always use prometheus-10 as the data source==== +====Important: select the correct data source when adding panels==== 
-When adding panelsmake sure the Data source dropdown shows **prometheus-10** (the data source you configured), not the default "prometheus" placeholder. If a panel shows "No data", check the data source selection first.+When adding each panelconfirm the Data source dropdown shows the Prometheus data source you configured (not the default placeholder). If a panel shows "No data", check this first.
  
 ====Panel shows No data==== ====Panel shows No data====
-Switch to **Last 5 minutes** time range and click **Run queries**If still no data, click **Code** in the query editor and type the metric name directly (e.g. ''nv_gpu_utilization_percent''), then run queriesThe GPU utilization panel will show a flat 0% line at idle — that is correct, not an error.+  - Change the time range to **Last 5 minutes** and click **Run queries** 
 +  - If still no data, click **Code** in the query editor and type the metric name directly, then run queries 
 +  - The GPU utilization panel will show a flat 0% line at idle — that is correct, not missing data
  
 \\ \\ \\ \\
-=====Step 12 — Load Test with demo-load=====+=====Step 12Load Test with demo-load=====
  
-demo-load is included in the nv-monitor repo and built by default with ''make''.+''demo-load'' is included in the nv-monitor repo and already built by ''make'' in Step 3.
  
   cd ~/nv-monitor   cd ~/nv-monitor
   ./demo-load --gpu   ./demo-load --gpu
  
-Output:+Expected output:
   Starting CPU load on 20 cores (sinusoidal, phased)   Starting CPU load on 20 cores (sinusoidal, phased)
   Starting GPU load on 1 GPU (sinusoidal)   Starting GPU load on 1 GPU (sinusoidal)
   Will stop in 5m 0s (Ctrl+C to stop early)   Will stop in 5m 0s (Ctrl+C to stop early)
-  GPU 0: calibrating... done (kernel=0.01ms, blocks=1024)+  GPU 0: calibrating... done
   GPU 0: load active   GPU 0: load active
  
-This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard for live spikes. +This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard — you should see all panels spike within a few seconds
- +  GPU Power: rises from ~4.5W idle to ~12W under load 
-[SCREENSHOTdemo-load terminal output showing GPU 0: load active] +  * CPU Usage %: cores hitting 80100% 
- +  * GPU Utilization: rises from 0% 
-[SCREENSHOT: Grafana dashboard with all 4 panels spiking — GPU Power jumping from 4.5W to 12WCPU Usage % hitting 80-100% across all cores, GPU Utilization spiking, CPU Temperature climbing from 45C to 70C+]+  * CPU Temperature: climbs from ~45°C to ~70°C+
  
 Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically. Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically.
Line 262: Line 253:
 nv-monitor and Docker containers do not auto-restart. To bring everything back: nv-monitor and Docker containers do not auto-restart. To bring everything back:
  
-**On spark02:**+**On the Spark:**
  
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
   docker start prometheus grafana   docker start prometheus grafana
  
 **On your Mac (new local terminal):** **On your Mac (new local terminal):**
  
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
  
 Then open ''http://localhost:3000''. Then open ''http://localhost:3000''.
Line 287: Line 278:
  
 ====nv-monitor binary does not exist after git clone==== ====nv-monitor binary does not exist after git clone====
-A file named ''nv-monitor'' already existed in the home directory (bad previous download).+A file or directory named ''nv-monitor'' already existed in the home directory before cloning.
  
-  rm nv-monitor+  rm -rf ~/nv-monitor
   git clone https://github.com/wentbackward/nv-monitor   git clone https://github.com/wentbackward/nv-monitor
   cd nv-monitor   cd nv-monitor
Line 295: Line 286:
  
 ====Prometheus target shows DOWN — context deadline exceeded==== ====Prometheus target shows DOWN — context deadline exceeded====
-Two causes, apply both fixes+Apply both fixes:
- +
-**Fix 1** — Use the correct target IP in ''prometheus.yml'':+
  
 +**Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost:
   targets: ['172.17.0.1:9101']   targets: ['172.17.0.1:9101']
  
-Then restart: ''docker restart prometheus''+Find the correct gateway IP with: ''docker network inspect bridge | grep Gateway''
  
-**Fix 2** — Allow Docker bridge through the firewall (spark02 uses iptables, not UFW):+Then restart Prometheus''docker restart prometheus''
  
 +**Fix 2** — Allow Docker bridge through the firewall:
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
 ====UFW command not found==== ====UFW command not found====
-spark02 does not have UFW installed. Use iptables directly (see Step 7).+The DGX Spark does not have UFW installed. Use iptables directly (see Step 7).
  
 ====SSH tunnel fails with "Address already in use"==== ====SSH tunnel fails with "Address already in use"====
-You ran the tunnel command from inside an existing SSH session to spark02 instead of from a Mac local terminal. Open a new terminal on your Mac (prompt should show your Mac hostname) and run the tunnel command from there+You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker containers binding ports 9090 and 3000. Open a new terminal on your Mac (prompt must show your Mac hostname, not the Spark) and run the tunnel from there.
- +
-====Grafana cannot connect to Prometheus — lookup prometheus: no such host==== +
-Containers are not on the same Docker network.+
  
 +====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"====
 +The containers are not on the same Docker network. Run:
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
   docker network connect monitoring grafana   docker network connect monitoring grafana
  
-Then set Grafana data source URL to ''http://prometheus:9090''.+Then set the Grafana data source URL to ''http://prometheus:9090''.
  
 ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000====
-Docker bypasses UFW iptables rules. Use SSH tunnel instead+Docker'iptables rules can bypass UFW, making direct browser access unreliable. Use SSH tunneling instead (see Step 8).
- +
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118+
  
 ====Grafana panel shows No data==== ====Grafana panel shows No data====
-  1. Check the Data source dropdown — must be **prometheus-10**, not the placeholder "prometheus"+  1. Check the Data source dropdown — must point to your configured Prometheus data source
   2. Change time range to **Last 5 minutes** and click **Run queries**   2. Change time range to **Last 5 minutes** and click **Run queries**
   3. Switch to **Code** mode and type the metric name directly   3. Switch to **Code** mode and type the metric name directly
-  4. GPU utilization at 0% is correct at idle — it is not an error+  4. GPU utilization showing 0% at idle is correct — not an error
  
-====Memory Used shows raw number like 4003753984==== +====Memory Used shows raw number like 4003753984==== 
-No unit set on the panel. Edit panel → Standard options → Unit → **bytes (SI)**.+No unit is set on the panel. Edit the panel → Standard options → Unit → select **bytes (SI)**.
  
 ====SUDO POLICY VIOLATION broadcast messages==== ====SUDO POLICY VIOLATION broadcast messages====
-This is a sysadmin audit policy on spark02Commands still execute — this is just a notification to the admin team that sudo was used. It is not an error.+This is a sysadmin audit policy. The command still executes — the broadcast is just a notification to the admin team. It is not an error.
  
 \\ \\ \\ \\
 [[wiki:ai:home-page|AI Home]] [[wiki:ai:home-page|AI Home]]
  
wiki/ai/dgx-spark-monitoring.1776423629.txt.gz · Last modified: by swilson