User Tools

Site Tools


wiki:ai:dgx-spark-monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:ai:dgx-spark-monitoring [2026/04/17 11:19] swilsonwiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson
Line 10: Line 10:
  
 \\ \\ \\ \\
-=====Step 1 — SSH into the DGX Spark=====+=====Step 1SSH into the DGX Spark=====
  
-From your Mac terminal, SSH into the Spark: +From your Local terminal, SSH into the Spark: 
- +  ssh YOUR_USERNAME@YOUR_SPARK_IP
-  ssh <your-username>@<spark-ip>+
  
 All steps below are run on the Spark unless noted otherwise. All steps below are run on the Spark unless noted otherwise.
  
 \\ \\ \\ \\
-=====Step 2 — Install Build Dependencies=====+=====Step 2Install Build Dependencies=====
  
   sudo apt install build-essential libncurses-dev -y   sudo apt install build-essential libncurses-dev -y
Line 29: Line 28:
  
 \\ \\ \\ \\
-=====Step 3 — Clone and Build nv-monitor=====+=====Step 3Clone and Build nv-monitor=====
  
   cd ~   cd ~
Line 39: Line 38:
  
 Verify it works by launching the interactive TUI: Verify it works by launching the interactive TUI:
- 
   ./nv-monitor   ./nv-monitor
  
Line 52: Line 50:
  
 \\ \\ \\ \\
-=====Step 4 — Run nv-monitor as a Prometheus Exporter=====+=====Step 4Run nv-monitor as a Prometheus Exporter=====
  
 Start nv-monitor in headless mode with a Bearer token: Start nv-monitor in headless mode with a Bearer token:
- 
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t <your-secret-token> &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
  
-Replace ''<your-secret-token>'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.+Replace ''YOUR_SECRET_TOKEN'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
  
 ====Flags explained==== ====Flags explained====
   * **-n:** headless mode — no TUI, runs silently in the background   * **-n:** headless mode — no TUI, runs silently in the background
   * **-p 9101:** expose Prometheus metrics endpoint on port 9101   * **-p 9101:** expose Prometheus metrics endpoint on port 9101
-  * **-t <token>:** require this Bearer token on every HTTP request+  * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request
   * **&:** run in background so the terminal stays free   * **&:** run in background so the terminal stays free
  
Line 72: Line 69:
  
 Verify it is working: Verify it is working:
- +  curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
-  curl -s -H "Authorization: Bearer <your-secret-token>" localhost:9101/metrics | head -10+
  
 You should see output starting with ''# HELP nv_build_info''. You should see output starting with ''# HELP nv_build_info''.
Line 88: Line 84:
  
 \\ \\ \\ \\
-=====Step 5 — Create the Prometheus Configuration=====+=====Step 5Create the Prometheus Configuration=====
  
   mkdir ~/monitoring   mkdir ~/monitoring
Line 95: Line 91:
   global:   global:
     scrape_interval: 5s     scrape_interval: 5s
- 
   scrape_configs:   scrape_configs:
     - job_name: 'nv-monitor'     - job_name: 'nv-monitor'
       authorization:       authorization:
-        credentials: '<your-secret-token>'+        credentials: 'YOUR_SECRET_TOKEN'
       static_configs:       static_configs:
         - targets: ['172.17.0.1:9101']         - targets: ['172.17.0.1:9101']
   EOF   EOF
  
-Replace ''<your-secret-token>'' with the same token you used in Step 4.+Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
  
 ====Why 172.17.0.1 and not localhost?==== ====Why 172.17.0.1 and not localhost?====
Line 113: Line 108:
  
 \\ \\ \\ \\
-=====Step 6 — Start Prometheus and Grafana in Docker=====+=====Step 6Start Prometheus and Grafana in Docker=====
  
   docker run -d \   docker run -d \
Line 127: Line 122:
  
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
Line 133: Line 127:
  
 Verify both are healthy: Verify both are healthy:
- 
   docker ps   docker ps
   curl -s localhost:9090/-/healthy   curl -s localhost:9090/-/healthy
Line 143: Line 136:
  
 \\ \\ \\ \\
-=====Step 7 — Allow Docker Bridge to Reach nv-monitor=====+=====Step 7Allow Docker Bridge to Reach nv-monitor=====
  
 Docker containers live in the ''172.17.x.x'' subnet. The host firewall must allow them to reach port 9101. Docker containers live in the ''172.17.x.x'' subnet. The host firewall must allow them to reach port 9101.
  
 **Note:** The DGX Spark does not have UFW installed. Use iptables directly: **Note:** The DGX Spark does not have UFW installed. Use iptables directly:
- 
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
Line 157: Line 149:
  
 \\ \\ \\ \\
-=====Step 8 — Access UIs from Your Mac via SSH Tunnel=====+=====Step 8Access UIs from Your Mac via SSH Tunnel=====
  
 SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale. SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale.
  
 On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname):
- +  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>+
  
 Keep this terminal open. Then open in your Mac browser: Keep this terminal open. Then open in your Mac browser:
Line 178: Line 169:
  
 \\ \\ \\ \\
-=====Step 9 — Verify Prometheus is Scraping=====+=====Step 9Verify Prometheus is Scraping=====
  
 Open ''http://localhost:9090/targets'' in your browser. Open ''http://localhost:9090/targets'' in your browser.
Line 189: Line 180:
  
 \\ \\ \\ \\
-=====Step 10 — Configure Grafana=====+=====Step 10Configure Grafana=====
  
 Open ''http://localhost:3000'' in your browser. Open ''http://localhost:3000'' in your browser.
Line 209: Line 200:
  
 \\ \\ \\ \\
-=====Step 11 — Build the Dashboard=====+=====Step 11Build the Dashboard=====
  
   - Click **Dashboards** → **New** → **New dashboard**   - Click **Dashboards** → **New** → **New dashboard**
Line 235: Line 226:
  
 \\ \\ \\ \\
-=====Step 12 — Load Test with demo-load=====+=====Step 12Load Test with demo-load=====
  
 ''demo-load'' is included in the nv-monitor repo and already built by ''make'' in Step 3. ''demo-load'' is included in the nv-monitor repo and already built by ''make'' in Step 3.
Line 265: Line 256:
  
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t <your-secret-token> &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
   docker start prometheus grafana   docker start prometheus grafana
  
 **On your Mac (new local terminal):** **On your Mac (new local terminal):**
  
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
  
 Then open ''http://localhost:3000''. Then open ''http://localhost:3000''.
Line 298: Line 289:
  
 **Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost: **Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost:
- 
   targets: ['172.17.0.1:9101']   targets: ['172.17.0.1:9101']
  
Line 306: Line 296:
  
 **Fix 2** — Allow Docker bridge through the firewall: **Fix 2** — Allow Docker bridge through the firewall:
- 
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
Line 317: Line 306:
 ====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"==== ====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"====
 The containers are not on the same Docker network. Run: The containers are not on the same Docker network. Run:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
wiki/ai/dgx-spark-monitoring.1776424764.txt.gz · Last modified: by swilson