A hands-on command reference for the four major lab domains on the NVIDIA Certified Professional — AI Operations exam. Work through each domain as a sequence: review the concept, run the commands, then test yourself with the troubleshooting scenarios at the end of each section.
BCM (formerly Bright Cluster Manager) is NVIDIA's central control plane for HPC/AI clusters. The two interfaces you must know are cmsh (the cluster management shell) and Base View (the web UI). The exam leans heavily on cmsh.
# SSH to the head node, then drop into the management shell ssh root@head-node cmsh # Inside cmsh, get help and orient yourself [head]% help [head]% main # return to the top-level prompt [head]% status # overall cluster status (services, license, HA) [head]% partition use base
cmsh uses modes. You enter a mode (device, category, softwareimage, user, etc.), then operate on objects inside it. After making changes, you must commit — pending changes show with an asterisk in the prompt.
The typical provisioning flow: pick or create a software image, attach it to a category, assign nodes to that category, then power them on so they PXE-boot and provision.
cmsh [head]% softwareimage [head->softwareimage]% list [head->softwareimage]% clone default-image gpu-image [head->softwareimage]% use gpu-image [head->softwareimage[gpu-image]]% set kernelversion 5.15.0-... [head->softwareimage[gpu-image]]% commit # Create a category that uses this image [head]% category [head->category]% add gpu-nodes [head->category[gpu-nodes*]]% set softwareimage gpu-image [head->category[gpu-nodes*]]% commit # Assign a node to the category and provision it [head]% device [head->device]% use node001 [head->device[node001]]% set category gpu-nodes [head->device[node001]]% commit [head->device[node001]]% power on # or: reset, off [head->device[node001]]% status # watch state: INSTALLING -> UP # Useful one-liners (run from cmsh prompt) [head]% device list # all devices and state [head]% device status -c gpu-nodes # status filtered by category [head]% device foreach -c gpu-nodes (status)
Outside cmsh, the equivalent inspection commands include:
cmha status # HA status if head nodes are paired module load shared # load the BCM environment module pdsh -g category=gpu-nodes uptime # parallel ssh across a category
“Aligning” means making sure the running nodes match the software image on the head node. After you edit packages or files in the image, you sync them out.
# Update packages inside an image (chroot-style) cm-chroot-sw-img /cm/images/gpu-image # ... inside the chroot ... yum install -y datacenter-gpu-manager exit # Push the image to running nodes — choose ONE depending on the situation [head]% device use node001 [head->device[node001]]% imageupdate # incremental sync (live) [head->device[node001]]% reinstall # full PXE reprovision # Verify nodes are in sync with the image [head]% device imageupdate -c gpu-nodes --dry-run # show what would change
Know the difference cold for the exam: imageupdate is a live rsync of the image to a running node (fast, but not all changes apply without reboot). reinstall wipes and PXE-provisions from scratch (clean but slow).
# Check current firmware across a category [head]% device foreach -c gpu-nodes (get bmcsettings) # BCM ships cm-update-firmware for BMC/BIOS rollouts cm-update-firmware --help cm-update-firmware --category gpu-nodes --firmware-package /root/fw/bios-1.2.bin # GPU firmware (e.g., VBIOS) is usually handled by nvidia-smi or vendor tools nvidia-smi -q | grep -i "vbios" # Vendor flashing is typically done in a maintenance window with nodes drained
[head]% user [head->user]% add alice [head->user[alice*]]% set commonname "Alice Researcher" [head->user[alice*]]% set groupname researchers [head->user[alice*]]% set password # interactive [head->user[alice*]]% commit # Profiles / RBAC — assign what a user can do in BCM itself [head]% profile list [head]% profile use readonly [head->profile[readonly]]% show [head]% user use alice [head->user[alice]]% set profile readonly [head->user[alice]]% commit
Built-in profiles to remember: admin, readonly, portal, cloudjob. You can clone and customize them with profile clone admin custom-admin.
nvidia-smi is the per-node quick look. dcgmi (Data Center GPU Manager) is the cluster-grade tool — it runs nv-hostengine as a service and supports persistent health checks, job-level stats, and policy.
# nvidia-smi — fast situational awareness nvidia-smi # default table nvidia-smi -q # exhaustive details nvidia-smi -q -d TEMPERATURE,POWER,CLOCK,ECC # only the sections you need nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,temperature.gpu \ --format=csv -l 1 # CSV, refresh every 1s nvidia-smi dmon -s pucvmet -d 1 # device monitor (power/util/clk/mem/ecc/temp) nvidia-smi pmon -c 5 # per-process, 5 samples nvidia-smi topo -m # GPU/NIC topology matrix (NVLink, PIX, SYS) nvidia-smi -pm 1 # persistence mode on (keep driver loaded) nvidia-smi -i 0 -ac 1215,1410 # set memory,graphics application clocks # dcgmi — cluster-grade health and diagnostics systemctl status nvidia-dcgm # the host engine must be running dcgmi discovery -l # list GPUs + entity IDs dcgmi group -c mygroup --default # create a group with all GPUs dcgmi group -l # list groups dcgmi health -g <group_id> -s mpi # set health watches (mem,power,thermal,nvlink...) dcgmi health -g <group_id> -c # check current health dcgmi diag -r 1 # quick diagnostic (~seconds) dcgmi diag -r 2 # medium (~2 min) dcgmi diag -r 3 # long (~15-30 min, real workload-like) dcgmi diag -r 4 # extra long, stress-test grade dcgmi dmon -e 203,204,250,252 -d 1000 # monitor specific field IDs every 1000ms dcgmi stats -g <group_id> -e # enable job stats collection dcgmi stats -s myjob # start tracking job "myjob" dcgmi stats -x myjob # stop and print summary
Field IDs worth memorizing: 150 SM clock, 155 Memory clock, 203 GPU util, 204 Memory util, 250 Memory used, 252 Memory free, 1001+ profiling metrics (DCGM Profiling — SM active, tensor active, etc.).
A user reports node005 is in the cluster but jobs aren't landing on it. Walk through the diagnosis.
cmsh [head]% device use node005 [head->device[node005]]% status # is it UP / CLOSED / DOWN / INSTALLER_FAILED? [head->device[node005]]% get category # right category? [head->device[node005]]% get powerstatus [head->device[node005]]% latesthealthdata # BCM health checks [head->device[node005]]% events # recent events for this node # If healthy in BCM but Slurm avoids it, check the scheduler (section 3)
The NCP-AIO exam tests the NVIDIA GPU Operator stack: Node Feature Discovery → Driver → Container Toolkit → Device Plugin → DCGM Exporter → MIG Manager → GPU Feature Discovery. You should be able to install it, inspect each component, and request GPUs in a pod spec.
# Prereqs: containerd or CRI-O, a running cluster, helm 3+, kubectl kubectl create ns gpu-operator kubectl label ns gpu-operator pod-security.kubernetes.io/enforce=privileged helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update # Most common install: let the operator manage drivers AND toolkit helm install --wait gpu-operator nvidia/gpu-operator \ -n gpu-operator \ --set driver.enabled=true \ --set toolkit.enabled=true # If drivers are already on the host, disable the operator's driver helm install --wait gpu-operator nvidia/gpu-operator \ -n gpu-operator --set driver.enabled=false
kubectl get pods -n gpu-operator # all components Running/Completed kubectl get nodes -o json | jq '.items[].status.capacity' | grep nvidia.com # Expect: "nvidia.com/gpu": "8" (or similar) kubectl describe node <gpu-node> | grep -A5 -i "nvidia.com" kubectl get clusterpolicies -n gpu-operator -o yaml | less # the operator's CR # Look at GPU Feature Discovery labels (they drive scheduling) kubectl get node <gpu-node> -o json | jq '.metadata.labels' | grep nvidia # Examples: nvidia.com/gpu.product=A100-SXM4-80GB # nvidia.com/cuda.driver.major=535 # nvidia.com/mig.strategy=single
apiVersion: v1 kind: Pod metadata: name: cuda-smoke-test spec: restartPolicy: OnFailure containers: - name: cuda image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1
kubectl apply -f cuda-smoke-test.yaml kubectl logs cuda-smoke-test # should show the GPU table
MIG is managed by the MIG Manager component. Switch a node into a profile by labeling it:
# List the available MIG configs from the configmap kubectl get cm -n gpu-operator default-mig-parted-config -o yaml # Apply a profile to a node — the MIG Manager will partition the GPU kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite # Watch the MIG Manager pod do its work kubectl logs -n gpu-operator -l app=nvidia-mig-manager -f # After it succeeds, the node advertises sliced resources kubectl describe node <gpu-node> | grep nvidia.com/mig # e.g. nvidia.com/mig-1g.10gb: 7
In a pod, you request the sliced resource by name:
resources: limits: nvidia.com/mig-1g.10gb: 1
Two MIG strategies controlled at install time — know both:
nvidia.com/gpu mapped to slices.nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb).# ConfigMap consumed by the device plugin apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: any: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4
kubectl apply -f time-slicing-config.yaml # Tell the cluster policy to use it kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \ -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
Each physical GPU now advertises 4 logical GPUs — useful for inference and dev workloads but not isolated like MIG.
The DCGM exporter pod scrapes GPU metrics and exposes them on :9400/metrics for Prometheus.
kubectl get svc -n gpu-operator nvidia-dcgm-exporter kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400 curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
# Pod stuck Pending — almost always a scheduling/resource issue kubectl describe pod <name> | tail -30 # Look for: "0/3 nodes are available: 3 Insufficient nvidia.com/gpu" # Operator components crash-looping kubectl get pods -n gpu-operator kubectl logs -n gpu-operator <pod> kubectl logs -n gpu-operator <pod> --previous # crashed container # Driver pod stuck — check kernel module compile kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset # Toolkit not wired into containerd kubectl logs -n gpu-operator -l app=nvidia-container-toolkit-daemonset # Verify the runtime config on the host: cat /etc/containerd/config.toml | grep -A5 nvidia # Validation pod runs at the end of install — its logs prove the stack works kubectl logs -n gpu-operator -l app=nvidia-operator-validator -c nvidia-operator-validator
Slurm is the dominant HPC scheduler. The exam tests GPU job scheduling via GRES, MIG-aware scheduling, and queue troubleshooting.
sinfo # partition + node state overview sinfo -N -l # per-node detail sinfo -o "%P %N %G %C %t" # partition, nodes, GRES, CPUs, state squeue # current queue squeue -u alice # per user squeue --start # estimated start times scontrol show node nodeA100-01 # full node detail (GRES, state, reason) scontrol show job 12345 # full job detail scontrol show partition gpu sacct -j 12345 --format=JobID,State,ExitCode,Elapsed,MaxRSS,ReqTRES,AllocTRES sacct -S 2026-05-01 -u alice -X # accounting since a date sreport cluster utilization start=2026-05-01
# Interactive single-GPU shell srun --gres=gpu:1 --pty bash # Interactive, target a specific GPU model srun --gres=gpu:a100:2 --pty bash
Batch script:
#!/bin/bash #SBATCH --job-name=resnet #SBATCH --partition=gpu #SBATCH --gres=gpu:a100:4 #SBATCH --cpus-per-task=16 #SBATCH --mem=128G #SBATCH --time=04:00:00 #SBATCH --output=resnet-%j.out module load cuda/12.4 nvidia-smi srun python train.py
sbatch train.sbatch
Cancel and inspect:
scancel 12345 scancel -u alice # all jobs by alice scancel -t PENDING -u alice # only pending
Two files matter — both on every compute node and the controller.
/etc/slurm/gres.conf (per-node, declares what hardware exists):
# /etc/slurm/gres.conf on nodeA100-01 Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-15 Name=gpu Type=a100 File=/dev/nvidia1 Cores=0-15 Name=gpu Type=a100 File=/dev/nvidia2 Cores=16-31 Name=gpu Type=a100 File=/dev/nvidia3 Cores=16-31
/etc/slurm/slurm.conf (cluster-wide, declares GRES is in use):
GresTypes=gpu,mig NodeName=nodeA100-[01-08] Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN PartitionName=gpu Nodes=nodeA100-[01-08] Default=YES MaxTime=24:00:00 State=UP SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory AccountingStorageTRES=gres/gpu
Push and reload:
# After editing slurm.conf / gres.conf on the controller, sync to nodes, # then reconfigure without restart if possible: scontrol reconfigure # Heavier changes (new partitions, GresTypes) need: systemctl restart slurmctld # on controller systemctl restart slurmd # on each node
When MIG is enabled, each slice should appear as its own GRES type so Slurm can schedule against it.
# gres.conf — one line per slice Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap21 Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap30 Name=gpu Type=2g.20gb File=/dev/nvidia-caps/nvidia-cap39
NodeName=mig-node-01 Gres=gpu:1g.10gb:6,gpu:2g.20gb:1 ...
Submit against a slice:
srun --gres=gpu:1g.10gb:1 --pty nvidia-smi -L
The actual /dev/nvidia-caps/ paths come from ls /proc/driver/nvidia/capabilities/mig/ after MIG is enabled with nvidia-smi mig -cgi … -C.
squeue --format="%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R" # The last column is the REASON. Common ones: # Resources - waiting for nodes (normal queueing) # Priority - other higher-priority jobs ahead # ReqNodeNotAvail - requested node is down/drained # AssocGrpGRESLimit - hit a per-association GRES quota # QOSMaxGRESPerUser - QoS limit # PartitionTimeLimit - asked for more time than partition allows # Dependency - waiting on another job # InvalidQOS - QoS not granted to user # ReqGresTypeNotAvail - asked for gpu:h100 when only a100 exists
When a node is DRAIN or DOWN:
sinfo -R # nodes in drain + the reason scontrol show node nodeA100-03 | grep -E "State|Reason" scontrol update NodeName=nodeA100-03 State=RESUME # bring it back after fixing scontrol update NodeName=nodeA100-03 State=DRAIN Reason="bad GPU"
A 4-GPU H100 job sits in PD with reasonResourcesforever, even thoughsinfoshows idle H100 nodes. Diagnose.
# 1. Confirm the job actually requested the right TYPE scontrol show job <id> | grep -E "TresPerNode|Gres" # 2. Confirm Slurm knows the nodes have h100 GRES (not just "gpu") scontrol show node nodeH100-01 | grep -E "Gres|CfgTRES" # If you see Gres=gpu:4 (no type), gres.conf is missing Type=h100 # 3. Check for QoS / association limits eating the job sacctmgr show assoc user=$USER format=Account,User,QOS,GrpTRES,MaxTRES sacctmgr show qos format=Name,GrpTRES,MaxTRESPerUser # 4. Check for reservations blocking the partition scontrol show reservation
The exam puts you in front of a broken cluster and asks for the next command. Build a mental flowchart: GPU → driver → container runtime → scheduler → network.
# Is the hardware visible? lspci | grep -i nvidia # Is the driver loaded? lsmod | grep nvidia nvidia-smi # if this fails, no driver / wrong driver # Kernel messages — ECC errors, Xid events, fallen-off-bus, thermal dmesg -T | grep -iE "nvidia|nvrm|xid" journalctl -k --since "1 hour ago" | grep -i nvidia # Xid codes you should know on sight: # 13 - Graphics Engine Exception (often app bug or bad memory) # 31 - GPU memory page fault (illegal address in CUDA code) # 43 - Reset channel verif error # 48 - Double-bit ECC error (uncorrectable, GPU usually needs reset) # 63/64 - Row-remapper recording/failure (HBM page retirement) # 74 - NVLink error # 79 - GPU fell off the bus (PCIe/power — often hardware) # 92 - High single-bit ECC (correctable but worth watching) # 94/95 - Contained/uncontained ECC error # 119 - GSP RPC timeout nvidia-smi -q -d ECC,PAGE_RETIREMENT,REMAPPED_ROWS # Generate a full bug report (attach to support tickets) nvidia-bug-report.sh
dcgmi diag is the canonical “is this GPU healthy” tool — run it before blaming software.
dcgmi diag -r 1 # ~seconds, software checks dcgmi diag -r 2 # ~2 min, includes targeted stress dcgmi diag -r 3 # ~30 min, sustained workloads (memory bw, SM stress, NVLink) dcgmi diag -r 4 # extended, hours # Per-GPU policy alerts (set once, leave running) dcgmi policy --set 0,0 -p 250 -T 95 -M 5 # warn on power>250W, temp>95C, mem errors dcgmi policy --get -g 0
If dcgmi diag flags a GPU, capture the JSON:
dcgmi diag -r 3 -j > diag.json
# Does the toolkit see the GPU? nvidia-container-cli info nvidia-container-cli list # Containerd config wiring crictl info | jq '.config.containerd.runtimes' cat /etc/containerd/config.toml | grep -A3 nvidia # Run a known-good test container outside of k8s to isolate ctr run --rm --gpus 0 \ docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
For Docker:
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
If this fails but nvidia-smi on the host works, the container toolkit (not the driver) is the problem.
When a job runs but is slow, walk the stack:
# 1. Is the GPU actually busy? nvidia-smi dmon -s u -d 1 # sm/mem util both low -> CPU/IO bound; mem high, sm low -> memory bound # 2. Are clocks being throttled? Look at the THROTTLE reasons. nvidia-smi -q -d PERFORMANCE # Watch for: HW Slowdown=Active, SW Thermal Slowdown, Power Brake # 3. Power and thermal headroom nvidia-smi --query-gpu=power.draw,power.limit,temperature.gpu --format=csv -l 1 # 4. Memory pressure / page retirement nvidia-smi -q -d PAGE_RETIREMENT,REMAPPED_ROWS,ECC # 5. NVLink / PCIe link health nvidia-smi nvlink -s # state of each link nvidia-smi nvlink -e # error counters nvidia-smi -q -d PIDS,CLOCK # processes and current clocks lspci -s <bdf> -vvv | grep -i "lnksta" # PCIe gen + width — is it at advertised speed? # 6. Topology — wrong GPU-to-NIC binding kills multinode perf nvidia-smi topo -m ibstat # IB ports, rates ibdev2netdev # IB device <-> netdev mapping
ibstat # link state, rate, port GUID ibstatus iblinkinfo # full fabric link summary ibping -S; ibping -G <guid> # one node as server, another pings perfquery # port counters; rerun for deltas ibdiagnet # subnet diagnosis (run on a single node) # NCCL — almost every multi-GPU AI failure shows up here NCCL_DEBUG=INFO mpirun -np 8 ./my_nccl_test # Important env vars to know: # NCCL_IB_HCA - which HCAs to use # NCCL_SOCKET_IFNAME - which Ethernet IFs to allow # NCCL_P2P_DISABLE - disable peer-to-peer (debug only) # NCCL_IB_DISABLE - force TCP fallback (debug only) # NCCL_TOPO_DUMP_FILE - dump the topology NCCL discovered # Standard health test /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8
# K8s: pod won't schedule on GPU node kubectl describe pod <name> # bottom Events section is the answer kubectl get events -A --sort-by='.lastTimestamp' | tail -30 # Slurm: queue full but nodes "look" idle sinfo -R # any drained nodes? scontrol show node <node> | grep -E "Reason|State|Gres" sdiag # scheduler internals - cycle times, backlog
cmsh [head]% monitoring [head->monitoring]% measurable list # available metrics [head->monitoring]% healthcheck list # built-in checks (mounts, gpu, ib, ...) [head->monitoring]% latesthealthdata # current health snapshot [head]% device latesthealthdata -c gpu-nodes [head]% events # cluster-wide event stream
| Symptom | First command | Then |
|---|---|---|
nvidia-smi: command not found | which nvidia-smi | Is driver installed? |
| “No devices were found” | lsmod | grep nvidia | dmesg | grep -i nvidia |
| GPU shows in lspci, not nvidia-smi | dmesg -T | grep Xid | Driver / module / fallen off bus |
| Container can't see GPU | nvidia-container-cli info | Toolkit + runtime config |
| Pod Pending | kubectl describe pod | Resources / labels / taints |
| Slurm job PD forever | squeue + scontrol show job | GRES type? QoS? Reservation? |
| Job runs but slow | nvidia-smi dmon | Throttling reasons + topology |
| Multi-node training slow | NCCL_DEBUG=INFO | IB link state, topo, GPU↔NIC |
| ECC errors / Xid 48/63/64 | nvidia-smi -q -d ECC | Drain node, page retirement |
Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/Deployment and Troubleshooting. The exam treats Docker as the substrate — what gets tested is “does this container have GPU access?” and “how do I diagnose it when something's wrong?” Know the runtime layer for standalone GPU servers (Docker) and for Kubernetes nodes (containerd).
nvidia runtime shim)# Configure Docker to use the nvidia runtime sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # Configure containerd (for Kubernetes nodes) sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd # Verify the runtime is wired in docker info | grep -i runtime cat /etc/containerd/config.toml | grep -A5 nvidia
Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from nvcr.io requires auth. This is the classic trip-up question on the exam.
docker login nvcr.io # Username: $oauthtoken <-- LITERAL string, with the dollar sign # Password: <your NGC API key> <-- from ngc.nvidia.com # Or non-interactive (for scripts / cmsh image builds) echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin # Then pull docker pull nvcr.io/nvidia/pytorch:24.10-py3 docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 docker pull nvcr.io/nvidia/tensorrt:24.10-py3
The username is literally $oauthtoken as a string — not your NVIDIA email, not a shell variable. Memorize this.
# Verify GPU access from a container (the smoke test) docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi # GPU selection flags — know all the forms docker run --gpus all ... # all GPUs docker run --gpus 2 ... # any 2 GPUs docker run --gpus '"device=0,2"' ... # specific GPU indices docker run --gpus '"device=GPU-abc123..."' ... # by UUID docker run --gpus '"device=MIG-d1..."' ... # by MIG slice UUID # Pass through capabilities and environment docker run --gpus all \ --env NVIDIA_VISIBLE_DEVICES=all \ --env NVIDIA_DRIVER_CAPABILITIES=compute,utility \ -v /data:/workspace/data \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ nvcr.io/nvidia/pytorch:24.10-py3
The –ipc=host and ulimit flags come up because PyTorch/NCCL multi-process workloads need shared memory and unlocked memory limits to work right.
# Inventory docker ps # running containers docker ps -a # all containers, including stopped/exited docker images # local image cache docker system df # disk usage by images/containers/volumes # Inspection docker logs <container> # stdout/stderr docker logs -f <container> # follow docker logs --tail 100 <container> # last N lines docker logs --since 10m <container> # time-bounded docker stats # live CPU/mem/net/IO per container docker stats --no-stream <container> # one-shot snapshot docker inspect <container> # full JSON config + state docker inspect <container> | jq '.[0].State' # exit code, OOMKilled, etc. docker top <container> # processes inside the container docker exec -it <container> bash # shell into it # Daemon-level docker events # live event stream from the daemon journalctl -u docker -f # daemon logs via systemd journalctl -u containerd -f # containerd logs (k8s nodes)
| Error | Cause |
|---|---|
could not select device driver “” with capabilities: gpu | NVIDIA Container Toolkit not installed/configured, or no GPU on host |
Failed to initialize NVML: Driver/library version mismatch | Host driver was updated, container has stale libnvidia-ml.so loaded |
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch | Reboot the host — kernel module out of sync with userspace |
OCI runtime create failed: … unknown capability “nvidia” | Runtime not registered with Docker — rerun nvidia-ctk runtime configure |
Container starts but nvidia-smi says No devices were found | NVIDIA_VISIBLE_DEVICES was overridden to none or void |
docker: Error response from daemon: pull access denied for nvcr.io/… | Not logged into NGC, or API key expired |
When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and crictl.
# Inspect what containerd sees crictl info | jq '.config.containerd.runtimes' crictl ps -a # all containers, including stopped crictl logs <container-id> crictl inspect <container-id> # Smoke-test the runtime directly without k8s in the middle ctr image pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 ctr run --rm --gpus 0 \ nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
If ctr run –gpus works but a pod doesn't, the problem is in the device plugin or k8s config — not the runtime.
A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run withdocker run --gpus all. Diagnose.
# 1. Was there any output at all? docker logs <container-id> docker inspect <container-id> | jq '.[0].State' # Look at ExitCode and OOMKilled # 2. Was the toolkit even invoked? docker info | grep -i runtime # Should list nvidia among runtimes # 3. Can a minimal container see the GPU? docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi # 4. If the minimal works but TF doesn't, it's app-level. Check the image's entrypoint. docker run --rm --gpus all -it --entrypoint bash nvcr.io/nvidia/tensorflow:<tag> # Inside, manually run the entrypoint to see what fails # 5. Common gotchas: # - Missing --ipc=host (silent crash on first NCCL call) # - --memory limit below model size (OOMKilled, ExitCode=137) # - Mounted dataset path doesn't exist inside the container
Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the conceptual model is highly testable. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI.
Note: the open-source KAI Scheduler is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster's Kubernetes to practice the scheduling concepts (queues, fair-share, gang scheduling) without a Run:ai license.
The hierarchy, top to bottom:
This comes up constantly on the exam.
Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed.
| Type | Use case | Priority | Preemptible |
|---|---|---|---|
| Workspace | Interactive dev (Jupyter, VSCode) | Low | Yes |
| Training | Batch training jobs | Medium | Yes |
| Distributed Training | Multi-node PyTorch/TF/MPI/XGBoost | Medium | Yes (gang) |
| Inference | Model serving with autoscaling | High | No |
Priority order, low → high: Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce.
Gang scheduling for distributed training: all worker pods start together or none do. No partial-start states.
# Authentication and context runai login runai config project <project-name> runai list projects runai list clusters # Submitting — know the variants runai submit-workspace <name> --image jupyter/base-notebook --gpu 1 runai submit <name> --image pytorch/pytorch --gpu 2 \ --command -- python train.py runai submit-dist pytorch <name> --workers 4 --gpu 1 \ --image nvcr.io/nvidia/pytorch:24.10-py3 # Fractional GPU — three different ways runai submit <name> --gpu 1 # whole GPU runai submit <name> --gpu-memory 4G # fraction by memory (4 GB) runai submit <name> --gpu-fraction 0.5 # fraction by share (50%) # Targeting hardware runai submit <name> --node-pools "h100-pool" runai submit <name> --node-type "dgx-a100" # Inspecting runai list jobs runai describe job <name> runai logs <name> runai logs -f <name> runai bash <name> # exec into the pod runai port-forward <name> --port 8888 # Lifecycle runai delete job <name> runai suspend job <name> runai resume job <name>
The difference between --gpu 1, --gpu-memory 4G, and --gpu-fraction 0.5 is a classic exam question — whole GPU vs fraction by memory vs fraction by share.
| Property | MIG | Run:ai Fractional GPU |
|---|---|---|
| Isolation | Hardware (memory, SMs, L2) | Software (runtime-enforced) |
| Supported GPUs | A100, A30, H100, H200, B200 | Any NVIDIA GPU |
| Slice sizes | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share) |
| Reconfiguration | Requires draining the GPU | Per-workload, no GPU reset |
| Memory protection | Hardware-enforced | Process-level via runtime |
If a question asks “tenant needs hardware-isolated GPU partitioning across an A100” → MIG. If it asks “team needs to run lots of small Jupyter notebooks on a V100/T4/L4” → Run:ai Fractional GPU (those GPUs don't support MIG).
Node pools segment the cluster by hardware capability.
# Typical pools in a heterogeneous cluster: # a100-pool - A100 nodes, MIG-enabled # h100-pool - H100 nodes # l40-pool - inference-grade GPUs # cpu-pool - CPU-only nodes for data prep # Projects are granted access to specific pools, with separate quotas per pool. # Workloads target pools at submission time. runai submit train-large --node-pools "h100-pool" --gpu 8 runai submit prep-data --node-pools "cpu-pool" --cpu 16
Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift.
Since full Run:ai requires a license, the open-source KAI Scheduler is the practical way to drill the scheduling concepts on your own cluster.
# Install on an existing Kubernetes cluster helm repo add nvidia-kai-scheduler https://nvidia.github.io/KAI-Scheduler helm install kai-scheduler nvidia-kai-scheduler/kai-scheduler \ -n kai-scheduler --create-namespace # Verify kubectl get pods -n kai-scheduler
What you can practice with KAI alone (no Run:ai license needed):
A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose.
This is expected behavior, not a bug. Workspaces are low-priority and preemptible; training jobs are higher priority. When a training job submitted by another project wants GPUs that the data science project is using only as over-quota, Run:ai preempts the workspace. Options:
device, category, softwareimage, user, monitoring, and commit.imageupdate (live sync) and reinstall (full PXE) — that's a classic question.single vs mixed) and how to request slices in both Kubernetes (nvidia.com/mig-1g.10gb) and Slurm (--gres=gpu:1g.10gb:1).dcgmi diag -r {1,2,3,4} levels — the time and depth differences come up.Resources, Priority, ReqNodeNotAvail, AssocGrpGRESLimit, QOSMaxGRESPerUser, ReqGresTypeNotAvail.NCCL_DEBUG, NCCL_IB_HCA, NCCL_SOCKET_IFNAME.$oauthtoken, password is the NGC API key. This trips people up every exam cycle.--gpus flag variants — all, 2, "device=0,2", by UUID, by MIG UUID.--gpu 1 (whole), --gpu-memory 4G (fraction by memory), --gpu-fraction 0.5 (fraction by share).Good luck — build the labs, break them on purpose, fix them, repeat.