User Tools

Site Tools


wiki:ai:dgx-spark-monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
wiki:ai:dgx-spark-monitoring [2026/04/17 09:56] – created swilsonwiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson
Line 7: Line 7:
   * **Grafana:** visualizes metrics in a live dashboard   * **Grafana:** visualizes metrics in a live dashboard
   * **demo-load:** synthetic CPU + GPU load generator for testing the pipeline   * **demo-load:** synthetic CPU + GPU load generator for testing the pipeline
-  * Everything runs on the DGX Spark (spark02) — Prometheus and Grafana run in Docker containers+  * Everything runs on the DGX Spark — Prometheus and Grafana run in Docker containers on the same machine
  
 \\ \\ \\ \\
-=====Step 1 — SSH into the DGX Spark=====+=====Step 1SSH into the DGX Spark=====
  
-From your Mac terminal: +From your Local terminal, SSH into the Spark
- +  ssh YOUR_USERNAME@YOUR_SPARK_IP
-  ssh swilson@100.91.118.118+
  
 All steps below are run on the Spark unless noted otherwise. All steps below are run on the Spark unless noted otherwise.
  
 \\ \\ \\ \\
-=====Step 2 — Install Build Dependencies=====+=====Step 2Install Build Dependencies=====
  
   sudo apt install build-essential libncurses-dev -y   sudo apt install build-essential libncurses-dev -y
Line 25: Line 24:
   * **build-essential:** gcc, make, and standard C libraries   * **build-essential:** gcc, make, and standard C libraries
   * **libncurses-dev:** required for the terminal UI (ncursesw wide character support)   * **libncurses-dev:** required for the terminal UI (ncursesw wide character support)
 +
 +If already installed, apt will report "already the newest version" and exit cleanly — that is fine.
  
 \\ \\ \\ \\
-=====Step 3 — Clone and Build nv-monitor=====+=====Step 3Clone and Build nv-monitor=====
  
   cd ~   cd ~
Line 34: Line 35:
   make   make
  
-Verify it works by launching the interactive TUI:+If the repo already exists from a previous run, git will print "destination path 'nv-monitor' already exists" and make will print "Nothing to be done for 'all'" — both are fine, the binary is already built.
  
 +Verify it works by launching the interactive TUI:
   ./nv-monitor   ./nv-monitor
  
Line 44: Line 46:
   * **GPU section:** utilization, temperature, power draw, clock speed   * **GPU section:** utilization, temperature, power draw, clock speed
   * **Memory section:** used, buf/cache, swap   * **Memory section:** used, buf/cache, swap
-  * **GPU Processes:** PID, user, type (C=compute, G=graphics), CPU%, GPU memory, command+  * **VRAM:** shows "unified memory (shared with CPU)" on GB10 — this is expectednvmlDeviceGetMemoryInfo returns NOT_SUPPORTED on the Grace-Blackwell unified memory architecture
   * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan)   * **History chart:** rolling 20-sample graph of CPU (green) and GPU (cyan)
  
 \\ \\ \\ \\
-=====Step 4 — Run nv-monitor as a Prometheus Exporter=====+=====Step 4Run nv-monitor as a Prometheus Exporter=====
  
 Start nv-monitor in headless mode with a Bearer token: Start nv-monitor in headless mode with a Bearer token:
- 
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN & 
 + 
 +Replace ''YOUR_SECRET_TOKEN'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
  
 ====Flags explained==== ====Flags explained====
-  * **-n:** headless mode — no TUI, runs silently in background+  * **-n:** headless mode — no TUI, runs silently in the background
   * **-p 9101:** expose Prometheus metrics endpoint on port 9101   * **-p 9101:** expose Prometheus metrics endpoint on port 9101
-  * **-t my-secret-token:** require this Bearer token on every request+  * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request
   * **&:** run in background so the terminal stays free   * **&:** run in background so the terminal stays free
  
-Verify it is working:+On startup it prints: 
 +  Prometheus metrics at http://0.0.0.0:9101/metrics 
 +  Running headless (Ctrl+C to stop)
  
-  curl -s -H "Authorization: Bearer my-secret-token" localhost:9101/metrics | head -10+Verify it is working: 
 +  curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
  
 You should see output starting with ''# HELP nv_build_info''. You should see output starting with ''# HELP nv_build_info''.
Line 78: Line 84:
  
 \\ \\ \\ \\
-=====Step 5 — Create the Prometheus Configuration=====+=====Step 5Create the Prometheus Configuration=====
  
   mkdir ~/monitoring   mkdir ~/monitoring
Line 85: Line 91:
   global:   global:
     scrape_interval: 5s     scrape_interval: 5s
- 
   scrape_configs:   scrape_configs:
     - job_name: 'nv-monitor'     - job_name: 'nv-monitor'
       authorization:       authorization:
-        credentials: 'my-secret-token'+        credentials: 'YOUR_SECRET_TOKEN'
       static_configs:       static_configs:
         - targets: ['172.17.0.1:9101']         - targets: ['172.17.0.1:9101']
   EOF   EOF
 +
 +Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
  
 ====Why 172.17.0.1 and not localhost?==== ====Why 172.17.0.1 and not localhost?====
   * Docker containers have their own network namespace   * Docker containers have their own network namespace
   * ''localhost'' inside a container refers to the container itself, not the host machine   * ''localhost'' inside a container refers to the container itself, not the host machine
-  * ''172.17.0.1'' is the Docker bridge gateway — the IP containers use to reach the host +  * ''172.17.0.1'' is the Docker bridge gateway — the IP that containers use to reach the host 
-  * Find it with: ''docker network inspect bridge | grep Gateway''+  * Verify the gateway IP on your system: ''docker network inspect bridge | grep Gateway''
  
 \\ \\ \\ \\
-=====Step 6 — Start Prometheus and Grafana in Docker=====+=====Step 6Start Prometheus and Grafana in Docker=====
  
   docker run -d \   docker run -d \
Line 115: Line 122:
  
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
Line 121: Line 127:
  
 Verify both are healthy: Verify both are healthy:
- 
   docker ps   docker ps
   curl -s localhost:9090/-/healthy   curl -s localhost:9090/-/healthy
Line 131: Line 136:
  
 \\ \\ \\ \\
-=====Step 7 — Allow Docker Bridge to Reach nv-monitor=====+=====Step 7Allow Docker Bridge to Reach nv-monitor===== 
 + 
 +Docker containers live in the ''172.17.x.x'' subnet. The host firewall must allow them to reach port 9101.
  
-Docker containers live in the ''172.17.x.x'' subnet. The firewall blocks them from reaching port 9101 on the host by default.+**Note:** The DGX Spark does not have UFW installed. Use iptables directly: 
 +  sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
-  sudo ufw allow from 172.17.0.0/16 to any port 9101+This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host).
  
-This is the critical rule that allows Prometheus to scrape nv-monitor.+====Note on SUDO POLICY VIOLATION broadcast messages==== 
 +The Spark has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — this is just a notification to the admin team. It is not an error.
  
 \\ \\ \\ \\
-=====Step 8 — Access UIs from Your Mac via SSH Tunnel=====+=====Step 8Access UIs from Your Mac via SSH Tunnel=====
  
-Docker's iptables rules bypass UFW, making direct browser access unreliable. SSH port forwarding is simplermore secure, and works over any network including Tailscale.+SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale.
  
-On your **Mac**, open a new terminal:+On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): 
 +  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
  
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118 +Keep this terminal open. Then open in your Mac browser:
- +
-Then open in your Mac browser:+
   * **Prometheus:** http://localhost:9090/targets   * **Prometheus:** http://localhost:9090/targets
   * **Grafana:** http://localhost:3000   * **Grafana:** http://localhost:3000
 +
 +====Common mistake — running the tunnel from inside the Spark====
 +If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with "Address already in use" — because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.
  
 ====Why SSH tunneling?==== ====Why SSH tunneling?====
-  * Works over Tailscale without needing to open firewall ports +  * Works over Tailscale without needing to open additional firewall ports 
-  * Encrypted by default — no plaintext traffic over the network +  * Traffic is encrypted by default
-  * No additional firewall rules needed+
   * Easy to disconnect by closing the terminal   * Easy to disconnect by closing the terminal
  
 \\ \\ \\ \\
-=====Step 9 — Verify Prometheus is Scraping=====+=====Step 9Verify Prometheus is Scraping=====
  
 Open ''http://localhost:9090/targets'' in your browser. Open ''http://localhost:9090/targets'' in your browser.
  
-You should see the **nv-monitor** job with state **UP** and a scrape duration under 10ms.+You should see the **nv-monitor** job listed with
 +  * State: **UP** (green) 
 +  * Scrape durationunder 10ms (typically ~2ms) 
 + 
 +If the state shows DOWN, see the Troubleshooting section.
  
 \\ \\ \\ \\
-=====Step 10 — Configure Grafana=====+=====Step 10Configure Grafana=====
  
 Open ''http://localhost:3000'' in your browser. Open ''http://localhost:3000'' in your browser.
  
   * Login: **admin** / **admin**   * Login: **admin** / **admin**
-  * Change password when prompted+  * Set a new password when prompted
  
 ====Add Prometheus as a data source==== ====Add Prometheus as a data source====
Line 183: Line 197:
  
 ====Why ''http://prometheus:9090'' works==== ====Why ''http://prometheus:9090'' works====
-Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically.+Both containers are on the same Docker network (''monitoring''). Docker provides DNS resolution between containers on the same network, so ''prometheus'' resolves to the Prometheus container's IP automatically. Using ''localhost:9090'' here would not work — it would refer to the Grafana container itself.
  
 \\ \\ \\ \\
-=====Step 11 — Build the Dashboard=====+=====Step 11Build the Dashboard=====
  
   - Click **Dashboards** → **New** → **New dashboard**   - Click **Dashboards** → **New** → **New dashboard**
   - Click **+ Add visualization**   - Click **+ Add visualization**
   - Add each panel below one at a time   - Add each panel below one at a time
 +  - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization type, then click **Back to dashboard**
  
 ====Dashboard panels==== ====Dashboard panels====
 +  * **CPU Usage %** — metric: ''nv_cpu_usage_percent'' — type: Time series
 +  * **CPU Temperature** — metric: ''nv_cpu_temperature_celsius'' — type: Time series
   * **GPU Utilization %** — metric: ''nv_gpu_utilization_percent'' — type: Time series   * **GPU Utilization %** — metric: ''nv_gpu_utilization_percent'' — type: Time series
   * **GPU Power (W)** — metric: ''nv_gpu_power_watts'' — type: Time series   * **GPU Power (W)** — metric: ''nv_gpu_power_watts'' — type: Time series
   * **GPU Temperature** — metric: ''nv_gpu_temperature_celsius'' — type: Time series   * **GPU Temperature** — metric: ''nv_gpu_temperature_celsius'' — type: Time series
-  * **CPU Usage %** — metric: ''nv_cpu_usage_percent'' — type: Time series 
-  * **CPU Temperature** — metric: ''nv_cpu_temperature_celsius'' — type: Time series 
   * **Memory Used** — metric: ''nv_memory_used_bytes'' — type: Gauge — unit: bytes (SI)   * **Memory Used** — metric: ''nv_memory_used_bytes'' — type: Gauge — unit: bytes (SI)
  
-Save the dashboard as **DGX Spark Monitor**. Set auto-refresh to **10s**.+Save the dashboard. Set auto-refresh to **10s** using the dropdown next to the Refresh button. 
 + 
 +====Important: select the correct data source when adding panels==== 
 +When adding each panel, confirm the Data source dropdown shows the Prometheus data source you configured (not the default placeholder). If a panel shows "No data", check this first. 
 + 
 +====Panel shows No data==== 
 +  - Change the time range to **Last 5 minutes** and click **Run queries** 
 +  - If still no data, click **Code** in the query editor and type the metric name directly, then run queries 
 +  - The GPU utilization panel will show a flat 0% line at idle — that is correct, not missing data
  
 \\ \\ \\ \\
-=====Step 12 — Load Test with demo-load=====+=====Step 12Load Test with demo-load=====
  
-Build the synthetic load generator:+''demo-load'' is included in the nv-monitor repo and already built by ''make'' in Step 3.
  
   cd ~/nv-monitor   cd ~/nv-monitor
-  make demo-load 
   ./demo-load --gpu   ./demo-load --gpu
  
-This generates sinusoidal CPU and GPU load simultaneouslyWatch the Grafana dashboard for live activity.+Expected output: 
 +  Starting CPU load on 20 cores (sinusoidal, phased) 
 +  Starting GPU load on 1 GPU (sinusoidal) 
 +  Will stop in 5m 0s (Ctrl+C to stop early) 
 +  GPU 0: calibrating... done 
 +  GPU 0: load active
  
-Verify the GPU is under load:+This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard — you should see all panels spike within a few seconds: 
 +  * GPU Power: rises from ~4.5W idle to ~12W under load 
 +  * CPU Usage %: cores hitting 80–100% 
 +  * GPU Utilization: rises from 0% 
 +  * CPU Temperatureclimbs from ~45°C to ~70°C+
  
-  nvidia-smi +Press **Ctrl+C** to stop early, or wait 5 minutes for it to finish automatically.
- +
-Expected output shows: +
-  * GPU-Util: ~40% +
-  * Temperature: ~60°C +
-  * Power: ~17W +
-  * Process: ''./demo-load'' using ~170MiB GPU memory +
- +
-Press **Ctrl+C** to stop the load.+
  
 \\ \\ \\ \\
Line 230: Line 253:
 nv-monitor and Docker containers do not auto-restart. To bring everything back: nv-monitor and Docker containers do not auto-restart. To bring everything back:
  
-**On spark02:**+**On the Spark:**
  
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t my-secret-token &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
   docker start prometheus grafana   docker start prometheus grafana
  
-**On your Mac (new terminal):**+**On your Mac (new local terminal):**
  
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
  
 Then open ''http://localhost:3000''. Then open ''http://localhost:3000''.
Line 255: Line 278:
  
 ====nv-monitor binary does not exist after git clone==== ====nv-monitor binary does not exist after git clone====
-A file named ''nv-monitor'' already existed in the home directory (bad previous download).+A file or directory named ''nv-monitor'' already existed in the home directory before cloning.
  
-  rm nv-monitor+  rm -rf ~/nv-monitor
   git clone https://github.com/wentbackward/nv-monitor   git clone https://github.com/wentbackward/nv-monitor
   cd nv-monitor   cd nv-monitor
Line 263: Line 286:
  
 ====Prometheus target shows DOWN — context deadline exceeded==== ====Prometheus target shows DOWN — context deadline exceeded====
-Two causes, apply both fixes+Apply both fixes:
- +
-**Fix 1** — Use the correct target IP in ''prometheus.yml'':+
  
 +**Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost:
   targets: ['172.17.0.1:9101']   targets: ['172.17.0.1:9101']
  
-Then restart: ''docker restart prometheus''+Find the correct gateway IP with: ''docker network inspect bridge | grep Gateway'' 
 + 
 +Then restart Prometheus: ''docker restart prometheus''
  
 **Fix 2** — Allow Docker bridge through the firewall: **Fix 2** — Allow Docker bridge through the firewall:
 +  sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
-  sudo ufw allow from 172.17.0.0/16 to any port 9101+====UFW command not found==== 
 +The DGX Spark does not have UFW installedUse iptables directly (see Step 7).
  
-====Grafana cannot connect to Prometheus — lookup prometheus: no such host==== +====SSH tunnel fails with "Address already in use"==== 
-Containers are not on the same Docker network.+You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker containers binding ports 9090 and 3000. Open a new terminal on your Mac (prompt must show your Mac hostname, not the Spark) and run the tunnel from there.
  
 +====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"====
 +The containers are not on the same Docker network. Run:
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
   docker network connect monitoring grafana   docker network connect monitoring grafana
  
-Then set Grafana data source URL to ''http://prometheus:9090''.+Then set the Grafana data source URL to ''http://prometheus:9090''.
  
 ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000==== ====Browser shows ERR_CONNECTION_RESET for port 9090 or 3000====
-Docker bypasses UFW iptables rules. Use SSH tunnel instead+Docker'iptables rules can bypass UFW, making direct browser access unreliable. Use SSH tunneling instead (see Step 8).
- +
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118 +
- +
-====CPU Usage % panel shows No data==== +
-The label filter ''{cpu="total"}'' does not existRemove the filter: +
- +
-  nv_cpu_usage_percent+
  
-Also change visualization from Stat to Time series.+====Grafana panel shows No data==== 
 +  1. Check the Data source dropdown — must point to your configured Prometheus data source 
 +  2. Change time range to **Last 5 minutes** and click **Run queries** 
 +  3. Switch to **Code** mode and type the metric name directly 
 +  4GPU utilization showing 0% at idle is correct — not an error
  
-====Memory Used shows raw number like 4003753984==== +====Memory Used shows raw number like 4003753984==== 
-No unit set on the panel. Edit panel → Standard options → Unit → **bytes (SI)**.+No unit is set on the panel. Edit the panel → Standard options → Unit → select **bytes (SI)**.
  
 ====SUDO POLICY VIOLATION broadcast messages==== ====SUDO POLICY VIOLATION broadcast messages====
-This is a sysadmin audit policy on the SparkCommands still execute — this is just a notification to the admin team that sudo was used. It is not an error.+This is a sysadmin audit policy. The command still executes — the broadcast is just a notification to the admin team. It is not an error.
  
 \\ \\ \\ \\
 [[wiki:ai:home-page|AI Home]] [[wiki:ai:home-page|AI Home]]
  
wiki/ai/dgx-spark-monitoring.1776419817.txt.gz · Last modified: by swilson