This is an old revision of the document!
From your Mac terminal:
ssh swilson@100.91.118.118
All steps below are run on the Spark unless noted otherwise.
sudo apt install build-essential libncurses-dev -y
cd ~ git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
Verify it works by launching the interactive TUI:
./nv-monitor
Press q to quit.
Start nv-monitor in headless mode with a Bearer token:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t my-secret-token &
Verify it is working:
curl -s -H "Authorization: Bearer my-secret-token" localhost:9101/metrics | head -10
You should see output starting with # HELP nv_build_info.
nv_cpu_usage_percent — per-core CPU usagenv_cpu_temperature_celsius — CPU temperaturenv_gpu_utilization_percent — GPU utilizationnv_gpu_power_watts — GPU power draw in wattsnv_gpu_temperature_celsius — GPU temperaturenv_memory_used_bytes — RAM used in bytesnv_load_average — system load average (1m, 5m, 15m)nv_uptime_seconds — system uptime
mkdir ~/monitoring
cat > ~/monitoring/prometheus.yml << 'EOF' global: scrape_interval: 5s
scrape_configs:
- job_name: 'nv-monitor'
authorization:
credentials: 'my-secret-token'
static_configs:
- targets: ['172.17.0.1:9101']
EOF
localhost inside a container refers to the container itself, not the host machine172.17.0.1 is the Docker bridge gateway — the IP containers use to reach the hostdocker network inspect bridge | grep Gateway
docker run -d \ --name prometheus \ -p 9090:9090 \ -v ~/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
docker run -d \ --name grafana \ -p 3000:3000 \ grafana/grafana
Connect both containers to a shared Docker network so Grafana can reach Prometheus by name:
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Verify both are healthy:
docker ps curl -s localhost:9090/-/healthy curl -s localhost:3000/api/health
Expected responses:
Prometheus Server is Healthy.{“database”:“ok”,…}
Docker containers live in the 172.17.x.x subnet. The firewall blocks them from reaching port 9101 on the host by default.
sudo ufw allow from 172.17.0.0/16 to any port 9101
This is the critical rule that allows Prometheus to scrape nv-monitor.
Docker's iptables rules bypass UFW, making direct browser access unreliable. SSH port forwarding is simpler, more secure, and works over any network including Tailscale.
On your Mac, open a new terminal:
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
Then open in your Mac browser:
Open http://localhost:9090/targets in your browser.
You should see the nv-monitor job with state UP and a scrape duration under 10ms.
Open http://localhost:3000 in your browser.
http://prometheus:9090
Both containers are on the same Docker network (monitoring). Docker provides DNS resolution between containers on the same network, so prometheus resolves to the Prometheus container's IP automatically.
nv_gpu_utilization_percent — type: Time seriesnv_gpu_power_watts — type: Time seriesnv_gpu_temperature_celsius — type: Time seriesnv_cpu_usage_percent — type: Time seriesnv_cpu_temperature_celsius — type: Time seriesnv_memory_used_bytes — type: Gauge — unit: bytes (SI)Save the dashboard as DGX Spark Monitor. Set auto-refresh to 10s.
Build the synthetic load generator:
cd ~/nv-monitor make demo-load ./demo-load --gpu
This generates sinusoidal CPU and GPU load simultaneously. Watch the Grafana dashboard for live activity.
Verify the GPU is under load:
nvidia-smi
Expected output shows:
./demo-load using ~170MiB GPU memoryPress Ctrl+C to stop the load.
nv-monitor and Docker containers do not auto-restart. To bring everything back:
On spark02:
cd ~/nv-monitor ./nv-monitor -n -p 9101 -t my-secret-token & docker start prometheus grafana
On your Mac (new terminal):
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
Then open http://localhost:3000.
docker stop prometheus grafana docker rm prometheus grafana docker network rm monitoring pkill nv-monitor rm -rf ~/monitoring
A file named nv-monitor already existed in the home directory (bad previous download).
rm nv-monitor git clone https://github.com/wentbackward/nv-monitor cd nv-monitor make
Two causes, apply both fixes:
Fix 1 — Use the correct target IP in prometheus.yml:
targets: ['172.17.0.1:9101']
Then restart: docker restart prometheus
Fix 2 — Allow Docker bridge through the firewall:
sudo ufw allow from 172.17.0.0/16 to any port 9101
Containers are not on the same Docker network.
docker network create monitoring docker network connect monitoring prometheus docker network connect monitoring grafana
Then set Grafana data source URL to http://prometheus:9090.
Docker bypasses UFW iptables rules. Use SSH tunnel instead:
ssh -L 9090:localhost:9090 -L 3000:localhost:3000 swilson@100.91.118.118
The label filter {cpu=“total”} does not exist. Remove the filter:
nv_cpu_usage_percent
Also change visualization from Stat to Time series.
No unit set on the panel. Edit panel → Standard options → Unit → bytes (SI).
This is a sysadmin audit policy on the Spark. Commands still execute — this is just a notification to the admin team that sudo was used. It is not an error.