This is an old revision of the document!
From your Mac terminal, SSH into the Spark:
ssh <your-username>@<spark-ip>
All steps below are run on the Spark unless noted otherwise.
sudo apt install build-essential libncurses-dev -y
If already installed, apt will report “already the newest version” and exit cleanly — that is fine.
cd ~ git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
If the repo already exists from a previous run, git will print “destination path 'nv-monitor' already exists” and make will print “Nothing to be done for 'all'” — both are fine, the binary is already built.
Verify it works by launching the interactive TUI:
./nv-monitor
Press q to quit.
Start nv-monitor in headless mode with a Bearer token:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t <your-secret-token> &
Replace <your-secret-token> with a strong token of your choice. You will use this same token in the Prometheus config in Step 5.
On startup it prints:
Prometheus metrics at http://0.0.0.0:9101/metrics Running headless (Ctrl+C to stop)
Verify it is working:
curl -s -H "Authorization: Bearer <your-secret-token>" localhost:9101/metrics | head -10
You should see output starting with # HELP nv_build_info.
nv_cpu_usage_percent — per-core CPU usagenv_cpu_temperature_celsius — CPU temperaturenv_gpu_utilization_percent — GPU utilizationnv_gpu_power_watts — GPU power draw in wattsnv_gpu_temperature_celsius — GPU temperaturenv_memory_used_bytes — RAM used in bytesnv_load_average — system load average (1m, 5m, 15m)nv_uptime_seconds — system uptime
mkdir ~/monitoring
cat > ~/monitoring/prometheus.yml << 'EOF' global: scrape_interval: 5s
scrape_configs:
- job_name: 'nv-monitor'
authorization:
credentials: '<your-secret-token>'
static_configs:
- targets: ['172.17.0.1:9101']
EOF
Replace <your-secret-token> with the same token you used in Step 4.
localhost inside a container refers to the container itself, not the host machine172.17.0.1 is the Docker bridge gateway — the IP that containers use to reach the hostdocker network inspect bridge | grep Gateway
docker run -d \ --name prometheus \ -p 9090:9090 \ -v ~/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
docker run -d \ --name grafana \ -p 3000:3000 \ grafana/grafana
Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Verify both are healthy:
docker ps curl -s localhost:9090/-/healthy curl -s localhost:3000/api/health
Expected responses:
Prometheus Server is Healthy.{“database”:“ok”,…}
Docker containers live in the 172.17.x.x subnet. The host firewall must allow them to reach port 9101.
Note: The DGX Spark does not have UFW installed. Use iptables directly:
sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
This is the critical rule that allows Prometheus (running in Docker) to scrape nv-monitor (running on the host).
The Spark has a sysadmin audit policy that broadcasts a message to all terminals when sudo is used. The command still executes — this is just a notification to the admin team. It is not an error.
SSH port forwarding is the recommended way to access the Grafana and Prometheus UIs from your Mac. It is simpler and more secure than opening firewall ports, and works over Tailscale.
On your Mac, open a new local terminal (not an SSH session to the Spark — the prompt must show your Mac hostname):
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>
Keep this terminal open. Then open in your Mac browser:
If you run the SSH tunnel command from a terminal that is already SSH'd into the Spark, it will SSH back to itself and fail with “Address already in use” — because ports 9090 and 3000 are already bound by the Docker containers on the Spark. Always run the tunnel from a Mac local terminal.
Open http://localhost:9090/targets in your browser.
You should see the nv-monitor job listed with:
If the state shows DOWN, see the Troubleshooting section.
Open http://localhost:3000 in your browser.
http://prometheus:9090
Both containers are on the same Docker network (monitoring). Docker provides DNS resolution between containers on the same network, so prometheus resolves to the Prometheus container's IP automatically. Using localhost:9090 here would not work — it would refer to the Grafana container itself.
nv_cpu_usage_percent — type: Time seriesnv_cpu_temperature_celsius — type: Time seriesnv_gpu_utilization_percent — type: Time seriesnv_gpu_power_watts — type: Time seriesnv_gpu_temperature_celsius — type: Time seriesnv_memory_used_bytes — type: Gauge — unit: bytes (SI)Save the dashboard. Set auto-refresh to 10s using the dropdown next to the Refresh button.
When adding each panel, confirm the Data source dropdown shows the Prometheus data source you configured (not the default placeholder). If a panel shows “No data”, check this first.
demo-load is included in the nv-monitor repo and already built by make in Step 3.
cd ~/nv-monitor ./demo-load --gpu
Expected output:
Starting CPU load on 20 cores (sinusoidal, phased) Starting GPU load on 1 GPU (sinusoidal) Will stop in 5m 0s (Ctrl+C to stop early) GPU 0: calibrating... done GPU 0: load active
This generates sinusoidal CPU and GPU load simultaneously for 5 minutes. Watch the Grafana dashboard — you should see all panels spike within a few seconds:
Press Ctrl+C to stop early, or wait 5 minutes for it to finish automatically.
nv-monitor and Docker containers do not auto-restart. To bring everything back:
On the Spark:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t <your-secret-token> & docker start prometheus grafana
On your Mac (new local terminal):
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 <your-username>@<spark-ip>
Then open http://localhost:3000.
docker stop prometheus grafana docker rm prometheus grafana docker network rm monitoring pkill nv-monitor rm -rf ~/monitoring
A file or directory named nv-monitor already existed in the home directory before cloning.
rm -rf ~/nv-monitor git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
Apply both fixes:
Fix 1 — Use the correct target IP in prometheus.yml. The target must be the Docker bridge gateway, not localhost:
targets: ['172.17.0.1:9101']
Find the correct gateway IP with: docker network inspect bridge | grep Gateway
Then restart Prometheus: docker restart prometheus
Fix 2 — Allow Docker bridge through the firewall:
sudo iptables -I INPUT -s 172.17.0.0/16 -p tcp --dport 9101 -j ACCEPT
The DGX Spark does not have UFW installed. Use iptables directly (see Step 7).
You ran the tunnel command from inside an existing SSH session to the Spark. The Spark already has Docker containers binding ports 9090 and 3000. Open a new terminal on your Mac (prompt must show your Mac hostname, not the Spark) and run the tunnel from there.
The containers are not on the same Docker network. Run:
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Then set the Grafana data source URL to http://prometheus:9090.
Docker's iptables rules can bypass UFW, making direct browser access unreliable. Use SSH tunneling instead (see Step 8).
1. Check the Data source dropdown — must point to your configured Prometheus data source 2. Change time range to **Last 5 minutes** and click **Run queries** 3. Switch to **Code** mode and type the metric name directly 4. GPU utilization showing 0% at idle is correct — not an error
No unit is set on the panel. Edit the panel → Standard options → Unit → select bytes (SI).
This is a sysadmin audit policy. The command still executes — the broadcast is just a notification to the admin team. It is not an error.