This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| wiki:ai:dgx-spark-monitoring [2026/04/17 11:19] – swilson | wiki:ai:dgx-spark-monitoring [2026/04/17 11:36] (current) – [Step 1: SSH into the DGX Spark] swilson | ||
|---|---|---|---|
| Line 10: | Line 10: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 1 — SSH into the DGX Spark===== | + | =====Step 1: SSH into the DGX Spark===== |
| - | From your Mac terminal, SSH into the Spark: | + | From your Local terminal, SSH into the Spark: |
| - | + | ssh YOUR_USERNAME@YOUR_SPARK_IP | |
| - | ssh < | + | |
| All steps below are run on the Spark unless noted otherwise. | All steps below are run on the Spark unless noted otherwise. | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 2 — Install Build Dependencies===== | + | =====Step 2: Install Build Dependencies===== |
| sudo apt install build-essential libncurses-dev -y | sudo apt install build-essential libncurses-dev -y | ||
| Line 29: | Line 28: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 3 — Clone and Build nv-monitor===== | + | =====Step 3: Clone and Build nv-monitor===== |
| cd ~ | cd ~ | ||
| Line 39: | Line 38: | ||
| Verify it works by launching the interactive TUI: | Verify it works by launching the interactive TUI: | ||
| - | |||
| ./ | ./ | ||
| Line 52: | Line 50: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 4 — Run nv-monitor as a Prometheus Exporter===== | + | =====Step 4: Run nv-monitor as a Prometheus Exporter===== |
| Start nv-monitor in headless mode with a Bearer token: | Start nv-monitor in headless mode with a Bearer token: | ||
| - | |||
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| - | Replace '' | + | Replace '' |
| ====Flags explained==== | ====Flags explained==== | ||
| * **-n:** headless mode — no TUI, runs silently in the background | * **-n:** headless mode — no TUI, runs silently in the background | ||
| * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | * **-p 9101:** expose Prometheus metrics endpoint on port 9101 | ||
| - | * **-t < | + | * **-t YOUR_SECRET_TOKEN:** require this Bearer token on every HTTP request |
| * **&:** run in background so the terminal stays free | * **&:** run in background so the terminal stays free | ||
| Line 72: | Line 69: | ||
| Verify it is working: | Verify it is working: | ||
| - | + | | |
| - | | + | |
| You should see output starting with ''# | You should see output starting with ''# | ||
| Line 88: | Line 84: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 5 — Create the Prometheus Configuration===== | + | =====Step 5: Create the Prometheus Configuration===== |
| mkdir ~/ | mkdir ~/ | ||
| Line 95: | Line 91: | ||
| global: | global: | ||
| scrape_interval: | scrape_interval: | ||
| - | |||
| scrape_configs: | scrape_configs: | ||
| - job_name: ' | - job_name: ' | ||
| authorization: | authorization: | ||
| - | credentials: | + | credentials: |
| static_configs: | static_configs: | ||
| - targets: [' | - targets: [' | ||
| EOF | EOF | ||
| - | Replace '' | + | Replace '' |
| ====Why 172.17.0.1 and not localhost? | ====Why 172.17.0.1 and not localhost? | ||
| Line 113: | Line 108: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 6 — Start Prometheus and Grafana in Docker===== | + | =====Step 6: Start Prometheus and Grafana in Docker===== |
| docker run -d \ | docker run -d \ | ||
| Line 127: | Line 122: | ||
| Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | Connect both containers to a shared Docker network so Grafana can reach Prometheus by name: | ||
| - | |||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||
| Line 133: | Line 127: | ||
| Verify both are healthy: | Verify both are healthy: | ||
| - | |||
| docker ps | docker ps | ||
| curl -s localhost: | curl -s localhost: | ||
| Line 143: | Line 136: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 7 — Allow Docker Bridge to Reach nv-monitor===== | + | =====Step 7: Allow Docker Bridge to Reach nv-monitor===== |
| Docker containers live in the '' | Docker containers live in the '' | ||
| **Note:** The DGX Spark does not have UFW installed. Use iptables directly: | **Note:** The DGX Spark does not have UFW installed. Use iptables directly: | ||
| - | |||
| sudo iptables -I INPUT -s 172.17.0.0/ | sudo iptables -I INPUT -s 172.17.0.0/ | ||
| Line 157: | Line 149: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 8 — Access UIs from Your Mac via SSH Tunnel===== | + | =====Step 8: Access UIs from Your Mac via SSH Tunnel===== |
| SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale. | SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale. | ||
| On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): | On your **Mac**, open a **new local terminal** (not an SSH session to the Spark — the prompt must show your Mac hostname): | ||
| - | + | | |
| - | | + | |
| Keep this terminal open. Then open in your Mac browser: | Keep this terminal open. Then open in your Mac browser: | ||
| Line 178: | Line 169: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 9 — Verify Prometheus is Scraping===== | + | =====Step 9: Verify Prometheus is Scraping===== |
| Open '' | Open '' | ||
| Line 189: | Line 180: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 10 — Configure Grafana===== | + | =====Step 10: Configure Grafana===== |
| Open '' | Open '' | ||
| Line 209: | Line 200: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 11 — Build the Dashboard===== | + | =====Step 11: Build the Dashboard===== |
| - Click **Dashboards** → **New** → **New dashboard** | - Click **Dashboards** → **New** → **New dashboard** | ||
| Line 235: | Line 226: | ||
| \\ \\ | \\ \\ | ||
| - | =====Step 12 — Load Test with demo-load===== | + | =====Step 12: Load Test with demo-load===== |
| '' | '' | ||
| Line 265: | Line 256: | ||
| cd ~/ | cd ~/ | ||
| - | ./ | + | ./ |
| docker start prometheus grafana | docker start prometheus grafana | ||
| **On your Mac (new local terminal): | **On your Mac (new local terminal): | ||
| - | ssh -L 9090: | + | ssh -L 9090: |
| Then open '' | Then open '' | ||
| Line 298: | Line 289: | ||
| **Fix 1** — Use the correct target IP in '' | **Fix 1** — Use the correct target IP in '' | ||
| - | |||
| targets: [' | targets: [' | ||
| Line 306: | Line 296: | ||
| **Fix 2** — Allow Docker bridge through the firewall: | **Fix 2** — Allow Docker bridge through the firewall: | ||
| - | |||
| sudo iptables -I INPUT -s 172.17.0.0/ | sudo iptables -I INPUT -s 172.17.0.0/ | ||
| Line 317: | Line 306: | ||
| ====Grafana cannot connect to Prometheus — " | ====Grafana cannot connect to Prometheus — " | ||
| The containers are not on the same Docker network. Run: | The containers are not on the same Docker network. Run: | ||
| - | |||
| docker network create monitoring | docker network create monitoring | ||
| docker network connect monitoring prometheus | docker network connect monitoring prometheus | ||