This is an old revision of the document!
From your Mac terminal:
ssh swilson@100.91.118.118
All steps below are run on the Spark unless noted otherwise.
[SCREENSHOT: Mac terminal showing swilson@spark02:~$ prompt after successful SSH login]
sudo apt install build-essential libncurses-dev -y
Both packages were already installed on spark02 (12.10ubuntu1 and 6.4+20240113). If already installed, apt reports “already the newest version” and exits cleanly.
cd ~ git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
If the repo already exists from a previous run, git will print “destination path 'nv-monitor' already exists” and make will print “Nothing to be done for 'all'” — both are fine, the binary is already built.
Verify it works by launching the interactive TUI:
./nv-monitor
Press q to quit.
[SCREENSHOT: nv-monitor TUI showing all 20 cores (0-9 X725 efficiency, 10-19 X925 performance), GPU 0 NVIDIA GB10 at 42C 4.7W 208MHz, MEM 5.4G used / 121.7G, unified memory label, uptime 11d 17h]
Start nv-monitor in headless mode with a Bearer token:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t my-secret-token &
On startup it prints:
Prometheus metrics at http://0.0.0.0:9101/metrics Running headless (Ctrl+C to stop)
Verify it is working:
curl -s -H "Authorization: Bearer my-secret-token" localhost:9101/metrics | head -10
You should see output starting with # HELP nv_build_info.
[SCREENSHOT: Terminal showing nv-monitor background process (PID 52653) and curl output with # HELP nv_build_info, nv_uptime_seconds, nv_load_average metrics]
nv_cpu_usage_percent — per-core CPU usagenv_cpu_temperature_celsius — CPU temperaturenv_gpu_utilization_percent — GPU utilizationnv_gpu_power_watts — GPU power draw in wattsnv_gpu_temperature_celsius — GPU temperaturenv_memory_used_bytes — RAM used in bytesnv_load_average — system load average (1m, 5m, 15m)nv_uptime_seconds — system uptime
mkdir ~/monitoring
cat > ~/monitoring/prometheus.yml << 'EOF' global: scrape_interval: 5s
scrape_configs:
- job_name: 'nv-monitor'
authorization:
credentials: 'my-secret-token'
static_configs:
- targets: ['172.17.0.1:9101']
EOF
localhost inside a container refers to the container itself, not the host machine172.17.0.1 is the Docker bridge gateway — the IP containers use to reach the hostdocker network inspect bridge | grep Gateway
docker run -d \ --name prometheus \ -p 9090:9090 \ -v ~/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
docker run -d \ --name grafana \ -p 3000:3000 \ grafana/grafana
Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Verify both are healthy:
docker ps curl -s localhost:9090/-/healthy curl -s localhost:3000/api/health
Expected responses:
Prometheus Server is Healthy.{“database”:“ok”,“version”:“12.4.2”,…}
Docker containers live in the 172.17.x.x subnet. The firewall must allow them to reach port 9101 on the host.
Note: spark02 does not have UFW installed. Use iptables directly:
sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
This is the critical rule that allows Prometheus to scrape nv-monitor.
spark02 has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — the broadcast is just a notification to the admin team. It is not an error.
SSH port forwarding is simpler, more secure, and works over Tailscale without opening firewall ports.
On your Mac, open a new local terminal (not an SSH session — the prompt must show your Mac hostname, not spark02):
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
Keep this terminal open. Then open in your Mac browser:
If you run the SSH tunnel command from a terminal that is already SSH'd into spark02, it will SSH back to itself and fail with “Address already in use” because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.
Open http://localhost:9090/targets in your browser.
You should see the nv-monitor job with state UP and a scrape duration of ~2ms.
[SCREENSHOT: Prometheus Target health page showing nv-monitor job, endpoint http://172.17.0.1:9101/metrics, state UP (green), last scrape 11s ago, duration 2ms]
Open http://localhost:3000 in your browser.
http://prometheus:9090[SCREENSHOT: Grafana data source config page showing URL http://prometheus:9090 and green “Successfully queried the Prometheus API” confirmation banner]
Both containers are on the same Docker network (monitoring). Docker provides DNS resolution between containers on the same network, so prometheus resolves to the Prometheus container's IP automatically.
nv_cpu_usage_percent — type: Time seriesnv_cpu_temperature_celsius — type: Time seriesnv_gpu_utilization_percent — type: Time seriesnv_gpu_power_watts — type: Time seriesnv_gpu_temperature_celsius — type: Time seriesnv_memory_used_bytes — type: Gauge — unit: bytes (SI)Save the dashboard as DGX Spark Monitor. Set auto-refresh to 10s.
When adding panels, make sure the Data source dropdown shows prometheus-10 (the data source you configured), not the default “prometheus” placeholder. If a panel shows “No data”, check the data source selection first.
Switch to Last 5 minutes time range and click Run queries. If still no data, click Code in the query editor and type the metric name directly (e.g. nv_gpu_utilization_percent), then run queries. The GPU utilization panel will show a flat 0% line at idle — that is correct, not an error.
demo-load is included in the nv-monitor repo and built by default with make.
cd ~/nv-monitor ./demo-load --gpu
Output:
Starting CPU load on 20 cores (sinusoidal, phased) Starting GPU load on 1 GPU (sinusoidal) Will stop in 5m 0s (Ctrl+C to stop early) GPU 0: calibrating... done (kernel=0.01ms, blocks=1024) GPU 0: load active
This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard for live spikes.
[SCREENSHOT: demo-load terminal output showing GPU 0: load active]
[SCREENSHOT: Grafana dashboard with all 4 panels spiking — GPU Power jumping from 4.5W to 12W, CPU Usage % hitting 80-100% across all cores, GPU Utilization spiking, CPU Temperature climbing from 45C to 70C+]
Press Ctrl+C to stop early, or wait 5 minutes for it to finish automatically.
nv-monitor and Docker containers do not auto-restart. To bring everything back:
On spark02:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t my-secret-token & docker start prometheus grafana
On your Mac (new local terminal):
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
Then open http://localhost:3000.
docker stop prometheus grafana docker rm prometheus grafana docker network rm monitoring pkill nv-monitor rm -rf ~/monitoring
A file named nv-monitor already existed in the home directory (bad previous download).
rm nv-monitor git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
Two causes, apply both fixes:
Fix 1 — Use the correct target IP in prometheus.yml:
targets: ['172.17.0.1:9101']
Then restart: docker restart prometheus
Fix 2 — Allow Docker bridge through the firewall (spark02 uses iptables, not UFW):
sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
spark02 does not have UFW installed. Use iptables directly (see Step 7).
You ran the tunnel command from inside an existing SSH session to spark02 instead of from a Mac local terminal. Open a new terminal on your Mac (prompt should show your Mac hostname) and run the tunnel command from there.
Containers are not on the same Docker network.
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Then set Grafana data source URL to http://prometheus:9090.
Docker bypasses UFW iptables rules. Use SSH tunnel instead:
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
1. Check the Data source dropdown — must be **prometheus-10**, not the placeholder "prometheus" 2. Change time range to **Last 5 minutes** and click **Run queries** 3. Switch to **Code** mode and type the metric name directly 4. GPU utilization at 0% is correct at idle — it is not an error
No unit set on the panel. Edit panel → Standard options → Unit → bytes (SI).
This is a sysadmin audit policy on spark02. Commands still execute — this is just a notification to the admin team that sudo was used. It is not an error.