====== NCP-AIO Lab Walkthrough Study Guide ======
A hands-on command reference for the four major lab domains on the NVIDIA Certified Professional — AI Operations exam. Work through each domain as a sequence: review the concept, run the commands, then test yourself with the troubleshooting scenarios at the end of each section.
----
===== 1. Base Command Manager (BCM) & Cluster Administration =====
BCM (formerly Bright Cluster Manager) is NVIDIA's central control plane for HPC/AI clusters. The two interfaces you must know are **''cmsh''** (the cluster management shell) and **Base View** (the web UI). The exam leans heavily on ''cmsh''.
==== 1.1 Getting into BCM ====
# SSH to the head node, then drop into the management shell
ssh root@head-node
cmsh
# Inside cmsh, get help and orient yourself
[head]% help
[head]% main # return to the top-level prompt
[head]% status # overall cluster status (services, license, HA)
[head]% partition use base
cmsh uses **modes**. You enter a mode (''device'', ''category'', ''softwareimage'', ''user'', etc.), then operate on objects inside it. After making changes, you **must** ''commit'' — pending changes show with an asterisk in the prompt.
==== 1.2 Deploying a system (provisioning a compute node) ====
The typical provisioning flow: pick or create a **software image**, attach it to a **category**, assign nodes to that category, then power them on so they PXE-boot and provision.
cmsh
[head]% softwareimage
[head->softwareimage]% list
[head->softwareimage]% clone default-image gpu-image
[head->softwareimage]% use gpu-image
[head->softwareimage[gpu-image]]% set kernelversion 5.15.0-...
[head->softwareimage[gpu-image]]% commit
# Create a category that uses this image
[head]% category
[head->category]% add gpu-nodes
[head->category[gpu-nodes*]]% set softwareimage gpu-image
[head->category[gpu-nodes*]]% commit
# Assign a node to the category and provision it
[head]% device
[head->device]% use node001
[head->device[node001]]% set category gpu-nodes
[head->device[node001]]% commit
[head->device[node001]]% power on # or: reset, off
[head->device[node001]]% status # watch state: INSTALLING -> UP
# Useful one-liners (run from cmsh prompt)
[head]% device list # all devices and state
[head]% device status -c gpu-nodes # status filtered by category
[head]% device foreach -c gpu-nodes (status)
Outside cmsh, the equivalent inspection commands include:
cmha status # HA status if head nodes are paired
module load shared # load the BCM environment module
pdsh -g category=gpu-nodes uptime # parallel ssh across a category
==== 1.3 Aligning baseline images ====
"Aligning" means making sure the running nodes match the software image on the head node. After you edit packages or files in the image, you sync them out.
# Update packages inside an image (chroot-style)
cm-chroot-sw-img /cm/images/gpu-image
# ... inside the chroot ...
yum install -y datacenter-gpu-manager
exit
# Push the image to running nodes — choose ONE depending on the situation
[head]% device use node001
[head->device[node001]]% imageupdate # incremental sync (live)
[head->device[node001]]% reinstall # full PXE reprovision
# Verify nodes are in sync with the image
[head]% device imageupdate -c gpu-nodes --dry-run # show what would change
Know the difference cold for the exam: **''imageupdate''** is a live rsync of the image to a running node (fast, but not all changes apply without reboot). **''reinstall''** wipes and PXE-provisions from scratch (clean but slow).
==== 1.4 Firmware updates ====
# Check current firmware across a category
[head]% device foreach -c gpu-nodes (get bmcsettings)
# BCM ships cm-update-firmware for BMC/BIOS rollouts
cm-update-firmware --help
cm-update-firmware --category gpu-nodes --firmware-package /root/fw/bios-1.2.bin
# GPU firmware (e.g., VBIOS) is usually handled by nvidia-smi or vendor tools
nvidia-smi -q | grep -i "vbios"
# Vendor flashing is typically done in a maintenance window with nodes drained
==== 1.5 User and permission management ====
[head]% user
[head->user]% add alice
[head->user[alice*]]% set commonname "Alice Researcher"
[head->user[alice*]]% set groupname researchers
[head->user[alice*]]% set password # interactive
[head->user[alice*]]% commit
# Profiles / RBAC — assign what a user can do in BCM itself
[head]% profile list
[head]% profile use readonly
[head->profile[readonly]]% show
[head]% user use alice
[head->user[alice]]% set profile readonly
[head->user[alice]]% commit
Built-in profiles to remember: **admin**, **readonly**, **portal**, **cloudjob**. You can clone and customize them with ''profile clone admin custom-admin''.
==== 1.6 Reading baseline metrics with nvidia-smi and dcgmi ====
''nvidia-smi'' is the per-node quick look. ''dcgmi'' (Data Center GPU Manager) is the cluster-grade tool — it runs ''nv-hostengine'' as a service and supports persistent health checks, job-level stats, and policy.
# nvidia-smi — fast situational awareness
nvidia-smi # default table
nvidia-smi -q # exhaustive details
nvidia-smi -q -d TEMPERATURE,POWER,CLOCK,ECC # only the sections you need
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,temperature.gpu \
--format=csv -l 1 # CSV, refresh every 1s
nvidia-smi dmon -s pucvmet -d 1 # device monitor (power/util/clk/mem/ecc/temp)
nvidia-smi pmon -c 5 # per-process, 5 samples
nvidia-smi topo -m # GPU/NIC topology matrix (NVLink, PIX, SYS)
nvidia-smi -pm 1 # persistence mode on (keep driver loaded)
nvidia-smi -i 0 -ac 1215,1410 # set memory,graphics application clocks
# dcgmi — cluster-grade health and diagnostics
systemctl status nvidia-dcgm # the host engine must be running
dcgmi discovery -l # list GPUs + entity IDs
dcgmi group -c mygroup --default # create a group with all GPUs
dcgmi group -l # list groups
dcgmi health -g -s mpi # set health watches (mem,power,thermal,nvlink...)
dcgmi health -g -c # check current health
dcgmi diag -r 1 # quick diagnostic (~seconds)
dcgmi diag -r 2 # medium (~2 min)
dcgmi diag -r 3 # long (~15-30 min, real workload-like)
dcgmi diag -r 4 # extra long, stress-test grade
dcgmi dmon -e 203,204,250,252 -d 1000 # monitor specific field IDs every 1000ms
dcgmi stats -g -e # enable job stats collection
dcgmi stats -s myjob # start tracking job "myjob"
dcgmi stats -x myjob # stop and print summary
Field IDs worth memorizing: **150** SM clock, **155** Memory clock, **203** GPU util, **204** Memory util, **250** Memory used, **252** Memory free, **1001+** profiling metrics (DCGM Profiling — SM active, tensor active, etc.).
==== 1.7 Practice scenario ====
> A user reports node005 is in the cluster but jobs aren't landing on it. Walk through the diagnosis.
cmsh
[head]% device use node005
[head->device[node005]]% status # is it UP / CLOSED / DOWN / INSTALLER_FAILED?
[head->device[node005]]% get category # right category?
[head->device[node005]]% get powerstatus
[head->device[node005]]% latesthealthdata # BCM health checks
[head->device[node005]]% events # recent events for this node
# If healthy in BCM but Slurm avoids it, check the scheduler (section 3)
----
===== 2. Kubernetes for AI Workloads =====
The NCP-AIO exam tests the NVIDIA GPU Operator stack: **Node Feature Discovery → Driver → Container Toolkit → Device Plugin → DCGM Exporter → MIG Manager → GPU Feature Discovery**. You should be able to install it, inspect each component, and request GPUs in a pod spec.
==== 2.1 Install the GPU Operator ====
# Prereqs: containerd or CRI-O, a running cluster, helm 3+, kubectl
kubectl create ns gpu-operator
kubectl label ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Most common install: let the operator manage drivers AND toolkit
helm install --wait gpu-operator nvidia/gpu-operator \
-n gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true
# If drivers are already on the host, disable the operator's driver
helm install --wait gpu-operator nvidia/gpu-operator \
-n gpu-operator --set driver.enabled=false
==== 2.2 Verify every layer is healthy ====
kubectl get pods -n gpu-operator # all components Running/Completed
kubectl get nodes -o json | jq '.items[].status.capacity' | grep nvidia.com
# Expect: "nvidia.com/gpu": "8" (or similar)
kubectl describe node | grep -A5 -i "nvidia.com"
kubectl get clusterpolicies -n gpu-operator -o yaml | less # the operator's CR
# Look at GPU Feature Discovery labels (they drive scheduling)
kubectl get node -o json | jq '.metadata.labels' | grep nvidia
# Examples: nvidia.com/gpu.product=A100-SXM4-80GB
# nvidia.com/cuda.driver.major=535
# nvidia.com/mig.strategy=single
==== 2.3 Request GPUs in a Pod ====
apiVersion: v1
kind: Pod
metadata:
name: cuda-smoke-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda
image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f cuda-smoke-test.yaml
kubectl logs cuda-smoke-test # should show the GPU table
==== 2.4 MIG with the GPU Operator ====
MIG is managed by the **MIG Manager** component. Switch a node into a profile by labeling it:
# List the available MIG configs from the configmap
kubectl get cm -n gpu-operator default-mig-parted-config -o yaml
# Apply a profile to a node — the MIG Manager will partition the GPU
kubectl label node nvidia.com/mig.config=all-1g.10gb --overwrite
# Watch the MIG Manager pod do its work
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -f
# After it succeeds, the node advertises sliced resources
kubectl describe node | grep nvidia.com/mig
# e.g. nvidia.com/mig-1g.10gb: 7
In a pod, you request the sliced resource by name:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Two **MIG strategies** controlled at install time — know both:
* **single** — node exposes one resource type, e.g. ''nvidia.com/gpu'' mapped to slices.
* **mixed** — node exposes each profile as its own resource (e.g. ''nvidia.com/mig-1g.10gb'', ''nvidia.com/mig-2g.20gb'').
==== 2.5 Time-slicing (oversubscribe a GPU without MIG) ====
# ConfigMap consumed by the device plugin
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
kubectl apply -f time-slicing-config.yaml
# Tell the cluster policy to use it
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
Each physical GPU now advertises 4 logical GPUs — useful for inference and dev workloads but **not** isolated like MIG.
==== 2.6 DCGM Exporter and metrics ====
The DCGM exporter pod scrapes GPU metrics and exposes them on ''%%:9400/metrics%%'' for Prometheus.
kubectl get svc -n gpu-operator nvidia-dcgm-exporter
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
==== 2.7 Troubleshooting checklist ====
# Pod stuck Pending — almost always a scheduling/resource issue
kubectl describe pod | tail -30
# Look for: "0/3 nodes are available: 3 Insufficient nvidia.com/gpu"
# Operator components crash-looping
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator
kubectl logs -n gpu-operator --previous # crashed container
# Driver pod stuck — check kernel module compile
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Toolkit not wired into containerd
kubectl logs -n gpu-operator -l app=nvidia-container-toolkit-daemonset
# Verify the runtime config on the host:
cat /etc/containerd/config.toml | grep -A5 nvidia
# Validation pod runs at the end of install — its logs prove the stack works
kubectl logs -n gpu-operator -l app=nvidia-operator-validator -c nvidia-operator-validator
----
===== 3. Slurm & Workload Management =====
Slurm is the dominant HPC scheduler. The exam tests GPU job scheduling via **GRES**, MIG-aware scheduling, and queue troubleshooting.
==== 3.1 Core commands ====
sinfo # partition + node state overview
sinfo -N -l # per-node detail
sinfo -o "%P %N %G %C %t" # partition, nodes, GRES, CPUs, state
squeue # current queue
squeue -u alice # per user
squeue --start # estimated start times
scontrol show node nodeA100-01 # full node detail (GRES, state, reason)
scontrol show job 12345 # full job detail
scontrol show partition gpu
sacct -j 12345 --format=JobID,State,ExitCode,Elapsed,MaxRSS,ReqTRES,AllocTRES
sacct -S 2026-05-01 -u alice -X # accounting since a date
sreport cluster utilization start=2026-05-01
==== 3.2 Submitting GPU jobs ====
# Interactive single-GPU shell
srun --gres=gpu:1 --pty bash
# Interactive, target a specific GPU model
srun --gres=gpu:a100:2 --pty bash
Batch script:
#!/bin/bash
#SBATCH --job-name=resnet
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=04:00:00
#SBATCH --output=resnet-%j.out
module load cuda/12.4
nvidia-smi
srun python train.py
sbatch train.sbatch
Cancel and inspect:
scancel 12345
scancel -u alice # all jobs by alice
scancel -t PENDING -u alice # only pending
==== 3.3 Configuring GRES for GPUs ====
Two files matter — both on every compute node and the controller.
**''/etc/slurm/gres.conf''** (per-node, declares what hardware exists):
# /etc/slurm/gres.conf on nodeA100-01
Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-15
Name=gpu Type=a100 File=/dev/nvidia1 Cores=0-15
Name=gpu Type=a100 File=/dev/nvidia2 Cores=16-31
Name=gpu Type=a100 File=/dev/nvidia3 Cores=16-31
**''/etc/slurm/slurm.conf''** (cluster-wide, declares GRES is in use):
GresTypes=gpu,mig
NodeName=nodeA100-[01-08] Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
PartitionName=gpu Nodes=nodeA100-[01-08] Default=YES MaxTime=24:00:00 State=UP
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageTRES=gres/gpu
Push and reload:
# After editing slurm.conf / gres.conf on the controller, sync to nodes,
# then reconfigure without restart if possible:
scontrol reconfigure
# Heavier changes (new partitions, GresTypes) need:
systemctl restart slurmctld # on controller
systemctl restart slurmd # on each node
==== 3.4 MIG + Slurm ====
When MIG is enabled, each slice should appear as its own GRES type so Slurm can schedule against it.
# gres.conf — one line per slice
Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap21
Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap30
Name=gpu Type=2g.20gb File=/dev/nvidia-caps/nvidia-cap39
NodeName=mig-node-01 Gres=gpu:1g.10gb:6,gpu:2g.20gb:1 ...
Submit against a slice:
srun --gres=gpu:1g.10gb:1 --pty nvidia-smi -L
The actual ''/dev/nvidia-caps/'' paths come from ''ls /proc/driver/nvidia/capabilities/mig/'' after MIG is enabled with ''nvidia-smi mig -cgi ... -C''.
==== 3.5 Why is my job stuck? The reason codes ====
squeue --format="%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R"
# The last column is the REASON. Common ones:
# Resources - waiting for nodes (normal queueing)
# Priority - other higher-priority jobs ahead
# ReqNodeNotAvail - requested node is down/drained
# AssocGrpGRESLimit - hit a per-association GRES quota
# QOSMaxGRESPerUser - QoS limit
# PartitionTimeLimit - asked for more time than partition allows
# Dependency - waiting on another job
# InvalidQOS - QoS not granted to user
# ReqGresTypeNotAvail - asked for gpu:h100 when only a100 exists
When a node is DRAIN or DOWN:
sinfo -R # nodes in drain + the reason
scontrol show node nodeA100-03 | grep -E "State|Reason"
scontrol update NodeName=nodeA100-03 State=RESUME # bring it back after fixing
scontrol update NodeName=nodeA100-03 State=DRAIN Reason="bad GPU"
==== 3.6 Practice scenario ====
> A 4-GPU H100 job sits in PD with reason ''Resources'' forever, even though ''sinfo'' shows idle H100 nodes. Diagnose.
# 1. Confirm the job actually requested the right TYPE
scontrol show job | grep -E "TresPerNode|Gres"
# 2. Confirm Slurm knows the nodes have h100 GRES (not just "gpu")
scontrol show node nodeH100-01 | grep -E "Gres|CfgTRES"
# If you see Gres=gpu:4 (no type), gres.conf is missing Type=h100
# 3. Check for QoS / association limits eating the job
sacctmgr show assoc user=$USER format=Account,User,QOS,GrpTRES,MaxTRES
sacctmgr show qos format=Name,GrpTRES,MaxTRESPerUser
# 4. Check for reservations blocking the partition
scontrol show reservation
----
===== 4. System Troubleshooting & Optimization =====
The exam puts you in front of a broken cluster and asks for the **next command**. Build a mental flowchart: GPU → driver → container runtime → scheduler → network.
==== 4.1 GPU and driver layer ====
# Is the hardware visible?
lspci | grep -i nvidia
# Is the driver loaded?
lsmod | grep nvidia
nvidia-smi # if this fails, no driver / wrong driver
# Kernel messages — ECC errors, Xid events, fallen-off-bus, thermal
dmesg -T | grep -iE "nvidia|nvrm|xid"
journalctl -k --since "1 hour ago" | grep -i nvidia
# Xid codes you should know on sight:
# 13 - Graphics Engine Exception (often app bug or bad memory)
# 31 - GPU memory page fault (illegal address in CUDA code)
# 43 - Reset channel verif error
# 48 - Double-bit ECC error (uncorrectable, GPU usually needs reset)
# 63/64 - Row-remapper recording/failure (HBM page retirement)
# 74 - NVLink error
# 79 - GPU fell off the bus (PCIe/power — often hardware)
# 92 - High single-bit ECC (correctable but worth watching)
# 94/95 - Contained/uncontained ECC error
# 119 - GSP RPC timeout
nvidia-smi -q -d ECC,PAGE_RETIREMENT,REMAPPED_ROWS
# Generate a full bug report (attach to support tickets)
nvidia-bug-report.sh
==== 4.2 DCGM diagnostics ====
''dcgmi diag'' is the canonical "is this GPU healthy" tool — run it before blaming software.
dcgmi diag -r 1 # ~seconds, software checks
dcgmi diag -r 2 # ~2 min, includes targeted stress
dcgmi diag -r 3 # ~30 min, sustained workloads (memory bw, SM stress, NVLink)
dcgmi diag -r 4 # extended, hours
# Per-GPU policy alerts (set once, leave running)
dcgmi policy --set 0,0 -p 250 -T 95 -M 5 # warn on power>250W, temp>95C, mem errors
dcgmi policy --get -g 0
If ''dcgmi diag'' flags a GPU, capture the JSON:
dcgmi diag -r 3 -j > diag.json
==== 4.3 Container runtime problems ====
# Does the toolkit see the GPU?
nvidia-container-cli info
nvidia-container-cli list
# Containerd config wiring
crictl info | jq '.config.containerd.runtimes'
cat /etc/containerd/config.toml | grep -A3 nvidia
# Run a known-good test container outside of k8s to isolate
ctr run --rm --gpus 0 \
docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
For Docker:
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
If this fails but ''nvidia-smi'' on the host works, the **container toolkit** (not the driver) is the problem.
==== 4.4 Performance bottleneck triage ====
When a job runs but is slow, walk the stack:
# 1. Is the GPU actually busy?
nvidia-smi dmon -s u -d 1
# sm/mem util both low -> CPU/IO bound; mem high, sm low -> memory bound
# 2. Are clocks being throttled? Look at the THROTTLE reasons.
nvidia-smi -q -d PERFORMANCE
# Watch for: HW Slowdown=Active, SW Thermal Slowdown, Power Brake
# 3. Power and thermal headroom
nvidia-smi --query-gpu=power.draw,power.limit,temperature.gpu --format=csv -l 1
# 4. Memory pressure / page retirement
nvidia-smi -q -d PAGE_RETIREMENT,REMAPPED_ROWS,ECC
# 5. NVLink / PCIe link health
nvidia-smi nvlink -s # state of each link
nvidia-smi nvlink -e # error counters
nvidia-smi -q -d PIDS,CLOCK # processes and current clocks
lspci -s -vvv | grep -i "lnksta" # PCIe gen + width — is it at advertised speed?
# 6. Topology — wrong GPU-to-NIC binding kills multinode perf
nvidia-smi topo -m
ibstat # IB ports, rates
ibdev2netdev # IB device <-> netdev mapping
==== 4.5 Network layer (RDMA / InfiniBand / NCCL) ====
ibstat # link state, rate, port GUID
ibstatus
iblinkinfo # full fabric link summary
ibping -S; ibping -G # one node as server, another pings
perfquery # port counters; rerun for deltas
ibdiagnet # subnet diagnosis (run on a single node)
# NCCL — almost every multi-GPU AI failure shows up here
NCCL_DEBUG=INFO mpirun -np 8 ./my_nccl_test
# Important env vars to know:
# NCCL_IB_HCA - which HCAs to use
# NCCL_SOCKET_IFNAME - which Ethernet IFs to allow
# NCCL_P2P_DISABLE - disable peer-to-peer (debug only)
# NCCL_IB_DISABLE - force TCP fallback (debug only)
# NCCL_TOPO_DUMP_FILE - dump the topology NCCL discovered
# Standard health test
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8
==== 4.6 Scheduling and capacity failures ====
# K8s: pod won't schedule on GPU node
kubectl describe pod # bottom Events section is the answer
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Slurm: queue full but nodes "look" idle
sinfo -R # any drained nodes?
scontrol show node | grep -E "Reason|State|Gres"
sdiag # scheduler internals - cycle times, backlog
==== 4.7 BCM-level health checks ====
cmsh
[head]% monitoring
[head->monitoring]% measurable list # available metrics
[head->monitoring]% healthcheck list # built-in checks (mounts, gpu, ib, ...)
[head->monitoring]% latesthealthdata # current health snapshot
[head]% device latesthealthdata -c gpu-nodes
[head]% events # cluster-wide event stream
==== 4.8 The triage flowchart to memorize ====
^ Symptom ^ First command ^ Then ^
| ''nvidia-smi: command not found'' | ''which nvidia-smi'' | Is driver installed? |
| "No devices were found" | ''lsmod %%|%% grep nvidia'' | ''dmesg %%|%% grep -i nvidia'' |
| GPU shows in lspci, not nvidia-smi | ''dmesg -T %%|%% grep Xid'' | Driver / module / fallen off bus |
| Container can't see GPU | ''nvidia-container-cli info'' | Toolkit + runtime config |
| Pod Pending | ''kubectl describe pod'' | Resources / labels / taints |
| Slurm job PD forever | ''squeue'' + ''scontrol show job'' | GRES type? QoS? Reservation? |
| Job runs but slow | ''nvidia-smi dmon'' | Throttling reasons + topology |
| Multi-node training slow | ''NCCL_DEBUG=INFO'' | IB link state, topo, GPU<->NIC |
| ECC errors / Xid 48/63/64 | ''nvidia-smi -q -d ECC'' | Drain node, page retirement |
----
===== 5. Docker & Container Runtimes =====
Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/Deployment and Troubleshooting. The exam treats Docker as the **substrate** — what gets tested is "does this container have GPU access?" and "how do I diagnose it when something's wrong?" Know the runtime layer for standalone GPU servers (Docker) and for Kubernetes nodes (containerd).
==== 5.1 The install order — memorize it ====
- **NVIDIA driver** (host kernel module)
- **Docker** (or containerd for k8s nodes)
- **NVIDIA Container Toolkit** (provides the ''nvidia'' runtime shim)
- **Configure the runtime** to advertise the toolkit
# Configure Docker to use the nvidia runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Configure containerd (for Kubernetes nodes)
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
# Verify the runtime is wired in
docker info | grep -i runtime
cat /etc/containerd/config.toml | grep -A5 nvidia
==== 5.2 NGC registry authentication ====
Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from ''nvcr.io'' requires auth. **This is the classic trip-up question on the exam.**
docker login nvcr.io
# Username: $oauthtoken <-- LITERAL string, with the dollar sign
# Password: <-- from ngc.nvidia.com
# Or non-interactive (for scripts / cmsh image builds)
echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
# Then pull
docker pull nvcr.io/nvidia/pytorch:24.10-py3
docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
docker pull nvcr.io/nvidia/tensorrt:24.10-py3
The username is literally ''%%$oauthtoken%%'' as a string — not your NVIDIA email, not a shell variable. Memorize this.
==== 5.3 Running GPU containers ====
# Verify GPU access from a container (the smoke test)
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# GPU selection flags — know all the forms
docker run --gpus all ... # all GPUs
docker run --gpus 2 ... # any 2 GPUs
docker run --gpus '"device=0,2"' ... # specific GPU indices
docker run --gpus '"device=GPU-abc123..."' ... # by UUID
docker run --gpus '"device=MIG-d1..."' ... # by MIG slice UUID
# Pass through capabilities and environment
docker run --gpus all \
--env NVIDIA_VISIBLE_DEVICES=all \
--env NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-v /data:/workspace/data \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
nvcr.io/nvidia/pytorch:24.10-py3
The ''--ipc=host'' and ''ulimit'' flags come up because PyTorch/NCCL multi-process workloads need shared memory and unlocked memory limits to work right.
==== 5.4 Troubleshooting containers ====
# Inventory
docker ps # running containers
docker ps -a # all containers, including stopped/exited
docker images # local image cache
docker system df # disk usage by images/containers/volumes
# Inspection
docker logs # stdout/stderr
docker logs -f # follow
docker logs --tail 100 # last N lines
docker logs --since 10m # time-bounded
docker stats # live CPU/mem/net/IO per container
docker stats --no-stream # one-shot snapshot
docker inspect # full JSON config + state
docker inspect | jq '.[0].State' # exit code, OOMKilled, etc.
docker top # processes inside the container
docker exec -it bash # shell into it
# Daemon-level
docker events # live event stream from the daemon
journalctl -u docker -f # daemon logs via systemd
journalctl -u containerd -f # containerd logs (k8s nodes)
==== 5.5 Error messages to recognize on sight ====
^ Error ^ Cause ^
| ''could not select device driver "" with capabilities: [[gpu]]'' | NVIDIA Container Toolkit not installed/configured, or no GPU on host |
| ''Failed to initialize NVML: Driver/library version mismatch'' | Host driver was updated, container has stale ''libnvidia-ml.so'' loaded |
| ''nvidia-container-cli: initialization error: nvml error: driver/library version mismatch'' | Reboot the host — kernel module out of sync with userspace |
| ''OCI runtime create failed: ... unknown capability "nvidia"'' | Runtime not registered with Docker — rerun ''nvidia-ctk runtime configure'' |
| Container starts but ''nvidia-smi'' says ''No devices were found'' | ''NVIDIA_VISIBLE_DEVICES'' was overridden to ''none'' or ''void'' |
| ''docker: Error response from daemon: pull access denied for nvcr.io/...'' | Not logged into NGC, or API key expired |
==== 5.6 containerd specifics for Kubernetes ====
When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and ''crictl''.
# Inspect what containerd sees
crictl info | jq '.config.containerd.runtimes'
crictl ps -a # all containers, including stopped
crictl logs
crictl inspect
# Smoke-test the runtime directly without k8s in the middle
ctr image pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
ctr run --rm --gpus 0 \
nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
If ''ctr run --gpus'' works but a pod doesn't, the problem is in the device plugin or k8s config — not the runtime.
==== 5.7 Practice scenario ====
> A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run with ''%%docker run --gpus all%%''. Diagnose.
# 1. Was there any output at all?
docker logs
docker inspect | jq '.[0].State'
# Look at ExitCode and OOMKilled
# 2. Was the toolkit even invoked?
docker info | grep -i runtime
# Should list nvidia among runtimes
# 3. Can a minimal container see the GPU?
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# 4. If the minimal works but TF doesn't, it's app-level. Check the image's entrypoint.
docker run --rm --gpus all -it --entrypoint bash nvcr.io/nvidia/tensorflow:
# Inside, manually run the entrypoint to see what fails
# 5. Common gotchas:
# - Missing --ipc=host (silent crash on first NCCL call)
# - --memory limit below model size (OOMKilled, ExitCode=137)
# - Mounted dataset path doesn't exist inside the container
----
===== 6. Run:ai (NVIDIA Run:ai) =====
Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the **conceptual model is highly testable**. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI.
Note: the open-source **KAI Scheduler** is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster's Kubernetes to practice the scheduling concepts (queues, fair-share, gang scheduling) without a Run:ai license.
==== 6.1 The conceptual model ====
The hierarchy, top to bottom:
- **Tenant** — top-level isolation (usually one per organization)
- **Cluster** — a Kubernetes cluster managed by Run:ai
- **Department** — group of projects, gets a slice of cluster quota
- **Project** — the multi-tenancy unit; users belong to projects, workloads run inside them
- **Workload** — the actual thing running (workspace, training, inference, etc.)
==== 6.2 Quota model ====
This comes up constantly on the exam.
* **Deserved quota** (a.k.a. //guaranteed//) — GPUs your project is guaranteed when it asks. **Cannot be preempted.**
* **Over-quota** — opportunistic use of idle GPUs beyond your deserved quota. **Can be preempted** when another project wants its deserved share back.
* **Over-quota priority** — which project gets leftovers first when multiple projects compete for over-quota.
* **Over-quota weight** — proportional fair-share for over-quota allocation across projects with equal priority.
Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed.
==== 6.3 Workload types ====
^ Type ^ Use case ^ Priority ^ Preemptible ^
| **Workspace** | Interactive dev (Jupyter, VSCode) | Low | Yes |
| **Training** | Batch training jobs | Medium | Yes |
| **Distributed Training** | Multi-node PyTorch/TF/MPI/XGBoost | Medium | Yes (gang) |
| **Inference** | Model serving with autoscaling | High | **No** |
**Priority order, low → high:** Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce.
**Gang scheduling** for distributed training: all worker pods start together or none do. No partial-start states.
==== 6.4 The runai CLI ====
# Authentication and context
runai login
runai config project
runai list projects
runai list clusters
# Submitting — know the variants
runai submit-workspace --image jupyter/base-notebook --gpu 1
runai submit --image pytorch/pytorch --gpu 2 \
--command -- python train.py
runai submit-dist pytorch --workers 4 --gpu 1 \
--image nvcr.io/nvidia/pytorch:24.10-py3
# Fractional GPU — three different ways
runai submit --gpu 1 # whole GPU
runai submit --gpu-memory 4G # fraction by memory (4 GB)
runai submit --gpu-fraction 0.5 # fraction by share (50%)
# Targeting hardware
runai submit --node-pools "h100-pool"
runai submit --node-type "dgx-a100"
# Inspecting
runai list jobs
runai describe job
runai logs
runai logs -f
runai bash # exec into the pod
runai port-forward --port 8888
# Lifecycle
runai delete job
runai suspend job
runai resume job
The difference between ''%%--gpu 1%%'', ''%%--gpu-memory 4G%%'', and ''%%--gpu-fraction 0.5%%'' is a **classic exam question** — whole GPU vs fraction by memory vs fraction by share.
==== 6.5 Fractional GPU is NOT MIG ====
^ Property ^ MIG ^ Run:ai Fractional GPU ^
| Isolation | Hardware (memory, SMs, L2) | Software (runtime-enforced) |
| Supported GPUs | A100, A30, H100, H200, B200 | Any NVIDIA GPU |
| Slice sizes | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share) |
| Reconfiguration | Requires draining the GPU | Per-workload, no GPU reset |
| Memory protection | Hardware-enforced | Process-level via runtime |
If a question asks "tenant needs hardware-isolated GPU partitioning across an A100" → **MIG**. If it asks "team needs to run lots of small Jupyter notebooks on a V100/T4/L4" → **Run:ai Fractional GPU** (those GPUs don't support MIG).
==== 6.6 Node pools ====
Node pools segment the cluster by hardware capability.
# Typical pools in a heterogeneous cluster:
# a100-pool - A100 nodes, MIG-enabled
# h100-pool - H100 nodes
# l40-pool - inference-grade GPUs
# cpu-pool - CPU-only nodes for data prep
# Projects are granted access to specific pools, with separate quotas per pool.
# Workloads target pools at submission time.
runai submit train-large --node-pools "h100-pool" --gpu 8
runai submit prep-data --node-pools "cpu-pool" --cpu 16
==== 6.7 Architecture ====
* **Run:ai Control Plane** — the management layer. Can be **SaaS** (hosted by NVIDIA) or **self-hosted**. One Control Plane can manage many clusters.
* **Run:ai Cluster components** — installed via Helm onto each managed Kubernetes cluster:
* The scheduler (now based on KAI Scheduler)
* The Run:ai operator
* The fractional GPU runtime
* Monitoring agents
* Workload controllers (one per workload type)
Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift.
==== 6.8 KAI Scheduler — the open-source path ====
Since full Run:ai requires a license, the open-source **KAI Scheduler** is the practical way to drill the scheduling concepts on your own cluster.
# Install on an existing Kubernetes cluster
helm repo add nvidia-kai-scheduler https://nvidia.github.io/KAI-Scheduler
helm install kai-scheduler nvidia-kai-scheduler/kai-scheduler \
-n kai-scheduler --create-namespace
# Verify
kubectl get pods -n kai-scheduler
What you can practice with KAI alone (no Run:ai license needed):
* Queue hierarchies and weighted fair-share
* Gang scheduling for distributed pods
* Pod priority and preemption
* Resource quotas across queues
* Even without GPUs — use CPU-only pods to practice the scheduler logic
==== 6.9 Practice scenario ====
> A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose.
This is **expected behavior**, not a bug. Workspaces are low-priority and preemptible; training jobs are higher priority. When a training job submitted by another project wants GPUs that the data science project is using only as over-quota, Run:ai preempts the workspace. Options:
- Increase the data science project's **deserved quota** so the workspaces are no longer in over-quota territory.
- Move long-running data science work into **training-type workloads** (higher priority than workspaces).
- Add **node pool** segregation — a dedicated pool for interactive work that batch training jobs can't preempt into.
- Schedule batch training jobs to a **different time window** via job priority/scheduling rules.
----
===== Final exam prep tips =====
* **Practice in cmsh blind.** The web UI is not what's being tested. Build muscle memory for ''device'', ''category'', ''softwareimage'', ''user'', ''monitoring'', and ''commit''.
* **Know the difference** between ''imageupdate'' (live sync) and ''reinstall'' (full PXE) — that's a classic question.
* **Memorize the GPU Operator components in order** and what each one does. If one is broken, what symptom appears?
* **Know MIG strategies** (''single'' vs ''mixed'') and how to request slices in both Kubernetes (''nvidia.com/mig-1g.10gb'') and Slurm (''%%--gres=gpu:1g.10gb:1%%'').
* **Recognize Xid codes** at least for 13, 31, 48, 63, 79, 94/95.
* **''dcgmi diag -r {1,2,3,4}''** levels — the time and depth differences come up.
* **Slurm reason codes** — ''Resources'', ''Priority'', ''ReqNodeNotAvail'', ''AssocGrpGRESLimit'', ''QOSMaxGRESPerUser'', ''ReqGresTypeNotAvail''.
* **NCCL env vars** — ''NCCL_DEBUG'', ''NCCL_IB_HCA'', ''NCCL_SOCKET_IFNAME''.
* **NGC auth — username is literally ''%%$oauthtoken%%''**, password is the NGC API key. This trips people up every exam cycle.
* **Docker ''%%--gpus%%'' flag variants** — ''all'', ''2'', ''%%"device=0,2"%%'', by UUID, by MIG UUID.
* **The "could not select device driver" error** = NVIDIA Container Toolkit missing or no GPU on host.
* **Run:ai quota model** — //deserved// is guaranteed, //over-quota// is preemptible. Inference > Training > Workspace in priority.
* **Fractional GPU ≠ MIG** — Fractional is software-enforced and works on any NVIDIA GPU; MIG is hardware-enforced and only on A100/H100/etc.
* **runai submit flags** — ''%%--gpu 1%%'' (whole), ''%%--gpu-memory 4G%%'' (fraction by memory), ''%%--gpu-fraction 0.5%%'' (fraction by share).
Good luck — build the labs, break them on purpose, fix them, repeat.
[[home-page|AI Cloud Managed Services Policies and Procedures]]