User Tools

Site Tools


wiki:ai:dgx-spark-monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:ai:dgx-spark-monitoring [2026/04/24 15:48] – [Step 2: Install Build Dependencies] swilsonwiki:ai:dgx-spark-monitoring [2026/04/24 16:24] (current) – [Step 11: Build the Dashboard] swilson
Line 53: Line 53:
   cd nv-monitor   cd nv-monitor
   make   make
 +  
 +{{:wiki:ai:screenshot_2026-04-17_at_3.36.58 pm.png|}}
  
 If the repo already exists from a previous run, git will print "destination path 'nv-monitor' already exists" and make will print "Nothing to be done for 'all'" — both are fine, the binary is already built. If the repo already exists from a previous run, git will print "destination path 'nv-monitor' already exists" and make will print "Nothing to be done for 'all'" — both are fine, the binary is already built.
Line 59: Line 61:
   ./nv-monitor   ./nv-monitor
      
-{{:wiki:ai:screenshot_2026-04-17_at_3.36.20 pm.png|}}+{{:wiki:ai:Screenshot 2026-04-17 at 3.38.02 PM.png|}}
  
 Press **q** to quit. Press **q** to quit.
Line 76: Line 78:
   cd ~/nv-monitor   cd ~/nv-monitor
   ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &   ./nv-monitor -n -p 9101 -t YOUR_SECRET_TOKEN &
 +  
 +{{:wiki:ai:screenshot_2026-04-17_at_3.39.19 pm.png|}}
  
 Replace ''YOUR_SECRET_TOKEN'' with a cryptographically secure token. Generate one with: Replace ''YOUR_SECRET_TOKEN'' with a cryptographically secure token. Generate one with:
Line 101: Line 105:
 Verify it is working: Verify it is working:
   curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10   curl -s -H "Authorization: Bearer YOUR_SECRET_TOKEN" localhost:9101/metrics | head -10
 +  
 +{{:wiki:ai:screenshot_2026-04-17_at_3.39.47 pm.png|}}
  
 You should see output starting with ''# HELP nv_build_info''. You should see output starting with ''# HELP nv_build_info''.
Line 129: Line 135:
         - targets: ['172.17.0.1:9101']         - targets: ['172.17.0.1:9101']
   EOF   EOF
 +  
 Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4. Replace ''YOUR_SECRET_TOKEN'' with the same token you used in Step 4.
 +
 +{{:wiki:ai:screenshot_2026-04-17_at_3.40.23 pm.png|}}
  
 ====Why 172.17.0.1 and not localhost?==== ====Why 172.17.0.1 and not localhost?====
Line 151: Line 159:
     -p 3000:3000 \     -p 3000:3000 \
     grafana/grafana     grafana/grafana
 +    
 + {{:wiki:ai:screenshot_2026-04-17_at_3.41.01 pm.png|}}
  
 Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
Line 165: Line 175:
   * ''Prometheus Server is Healthy.''   * ''Prometheus Server is Healthy.''
   * ''{"database":"ok",...}''   * ''{"database":"ok",...}''
 +
 + {{:wiki:ai:screenshot_2026-04-17_at_3.41.28 pm.png|}}
 +
  
 \\ \\ \\ \\
Line 173: Line 186:
 **Note:** The DGX Spark does not have UFW installed. Use iptables directly: **Note:** The DGX Spark does not have UFW installed. Use iptables directly:
   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT   sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
 +  
 + {{:wiki:ai:screenshot_2026-04-17_at_3.42.51 pm.png|}}
  
 This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host). This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host).
Line 209: Line 224:
 On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname):
   ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP   ssh -L 9090:localhost:9090 -L 3000:localhost:3000 YOUR_USERNAME@YOUR_SPARK_IP
 +  
 +{{:wiki:ai:screenshot_2026-04-17_at_3.43.41 pm.png|}}
  
 Keep this terminal open. Then open in your Mac browser: Keep this terminal open. Then open in your Mac browser:
Line 230: Line 247:
   * State: **UP** (green)   * State: **UP** (green)
   * Scrape duration: under 10ms (typically ~2ms)   * Scrape duration: under 10ms (typically ~2ms)
 +
 +{{:wiki:ai:screenshot_2026-04-17_at_3.49.16 pm.png|}}
  
 If the state shows DOWN, see the Troubleshooting section. If the state shows DOWN, see the Troubleshooting section.
Line 241: Line 260:
   * Set a new password when prompted   * Set a new password when prompted
  
 +{{:wiki:ai:screenshot_2026-04-17_at_3.54.06 pm.png|}}
 ====Add Prometheus as a data source==== ====Add Prometheus as a data source====
   - Click **Connections** in the left sidebar   - Click **Connections** in the left sidebar
Line 249: Line 269:
   - Click **Save & test**   - Click **Save & test**
   - You should see: **Successfully queried the Prometheus API**   - You should see: **Successfully queried the Prometheus API**
 +
 +{{:wiki:ai:screenshot_2026-04-17_at_3.57.53 pm.png|}}
 +{{:wiki:ai:screenshot_2026-04-17_at_3.59.01 pm.png|}}
 +{{:wiki:ai:Screenshot 2026-04-17 at 4.01.33 pm.png|}}
 +{{:wiki:ai:Screenshot 2026-04-17 at 4.02.47 PM.png|}}
 +
 +
 +
  
 ====Why ''http://prometheus:9090'' works==== ====Why ''http://prometheus:9090'' works====
Line 262: Line 290:
   - Add each panel below one at a time   - Add each panel below one at a time
   - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization type, then click **Back to dashboard**   - For each panel: select the metric in the Builder tab, set the title in the right panel options, confirm the visualization type, then click **Back to dashboard**
 +
 +{{:wiki:ai:screenshot_2026-04-17_at_4.03.34 pm.png|}}
 +{{:wiki:ai:Screenshot 2026-04-17 at 4.03.56 PM.png|}}
 +{{:wiki:ai:Screenshot 2026-04-17 at 4.11.42 PM.png|}}
 +
 +
 +
  
 ====Dashboard panels==== ====Dashboard panels====
Line 288: Line 323:
   cd ~/nv-monitor   cd ~/nv-monitor
   ./demo-load --gpu   ./demo-load --gpu
 +  
 +{{:wiki:ai:screenshot_2026-04-17_at_4.24.57 pm.png|}}
  
 Expected output: Expected output:
Line 385: Line 422:
 \\ \\ \\ \\
 [[wiki:ai:home-page|AI Home]] [[wiki:ai:home-page|AI Home]]
- 
wiki/ai/dgx-spark-monitoring.1777045681.txt.gz · Last modified: by swilson