User Tools

Site Tools


wiki:ai:dgx-spark-monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:ai:dgx-spark-monitoring [2026/04/17 11:26] swilsonwiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson
Line 12: Line 12:
 =====Step 1: SSH into the DGX Spark===== =====Step 1: SSH into the DGX Spark=====
  
-From your Mac terminal, SSH into the Spark: +From your Local terminal, SSH into the Spark: 
- +  ssh YOUR_USERNAME@YOUR_SPARK_IP
-  ssh <your-username>@<spark-ip>+
  
 All steps below are run on the Spark unless noted otherwise. All steps below are run on the Spark unless noted otherwise.
Line 39: Line 38:
  
 Verify it works by launching the interactive TUI: Verify it works by launching the interactive TUI:
- 
   ./nv-monitor   ./nv-monitor
  
Line 55: Line 53:
  
 Start nv-monitor in headless mode with a Bearer token: Start nv-monitor in headless mode with a Bearer token:
- 
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t <your-secret-token> &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
  
-Replace ''<your-secret-token>'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.+Replace ''YOUR_SECRET_TOKEN'' with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
  
 ====Flags explained==== ====Flags explained====
   * **-n:** headless mode — no TUI, runs silently in the background   * **-n:** headless mode — no TUI, runs silently in the background
   * **-p 9101:** expose Prometheus metrics endpoint on port 9101   * **-p 9101:** expose Prometheus metrics endpoint on port 9101
-  * **-t <token>:** require this Bearer token on every HTTP request+  * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request
   * **&:** run in background so the terminal stays free   * **&:** run in background so the terminal stays free
  
Line 72: Line 69:
  
 Verify it is working: Verify it is working:
- +  curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
-  curl -s -H "Authorization: Bearer <your-secret-token>" localhost:9101/metrics | head -10+
  
 You should see output starting with ''# HELP nv_build_info''. You should see output starting with ''# HELP nv_build_info''.
Line 95: Line 91:
   global:   global:
     scrape_interval: 5s     scrape_interval: 5s
- 
   scrape_configs:   scrape_configs:
     - job_name: 'nv-monitor'     - job_name: 'nv-monitor'
       authorization:       authorization:
-        credentials: '<your-secret-token>'+        credentials: 'YOUR_SECRET_TOKEN'
       static_configs:       static_configs:
         - targets: ['172.17.0.1:9101']         - targets: ['172.17.0.1:9101']
   EOF   EOF
  
-Replace ''<your-secret-token>'' with the same token you used in Step 4.+Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
  
 ====Why 172.17.0.1 and not localhost?==== ====Why 172.17.0.1 and not localhost?====
Line 127: Line 122:
  
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
Line 133: Line 127:
  
 Verify both are healthy: Verify both are healthy:
- 
   docker ps   docker ps
   curl -s localhost:9090/-/healthy   curl -s localhost:9090/-/healthy
Line 148: Line 141:
  
 **Note:** The DGX Spark does not have UFW installed. Use iptables directly: **Note:** The DGX Spark does not have UFW installed. Use iptables directly:
- 
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
Line 162: Line 154:
  
 On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname):
- +  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>+
  
 Keep this terminal open. Then open in your Mac browser: Keep this terminal open. Then open in your Mac browser:
Line 265: Line 256:
  
   cd ~/nv-monitor   cd ~/nv-monitor
-  ./nv-monitor -n -p 9101 -t <your-secret-token> &+  ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
   docker start prometheus grafana   docker start prometheus grafana
  
 **On your Mac (new local terminal):** **On your Mac (new local terminal):**
  
-  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>+  ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
  
 Then open ''http://localhost:3000''. Then open ''http://localhost:3000''.
Line 298: Line 289:
  
 **Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost: **Fix 1** — Use the correct target IP in ''prometheus.yml''. The target must be the Docker bridge gateway, not localhost:
- 
   targets: ['172.17.0.1:9101']   targets: ['172.17.0.1:9101']
  
Line 306: Line 296:
  
 **Fix 2** — Allow Docker bridge through the firewall: **Fix 2** — Allow Docker bridge through the firewall:
- 
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
  
Line 317: Line 306:
 ====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"==== ====Grafana cannot connect to Prometheus — "lookup prometheus: no such host"====
 The containers are not on the same Docker network. Run: The containers are not on the same Docker network. Run:
- 
   docker network create monitoring   docker network create monitoring
   docker network connect monitoring prometheus   docker network connect monitoring prometheus
wiki/ai/dgx-spark-monitoring.1776425214.txt.gz · Last modified: by swilson