====== NCP-AIO Lab Walkthrough Study Guide ======

A hands-on command reference for the four major lab domains on the NVIDIA Certified Professional — AI Operations exam. Work through each domain as a sequence: review the concept, run the commands, then test yourself with the troubleshooting scenarios at the end of each section.

----

===== 1. Base Command Manager (BCM) & Cluster Administration =====

BCM (formerly Bright Cluster Manager) is NVIDIA's central control plane for HPC/AI clusters. The two interfaces you must know are **''cmsh''** (the cluster management shell) and **Base View** (the web UI). The exam leans heavily on ''cmsh''.

==== 1.1 Getting into BCM ====

<code bash>
# SSH to the head node, then drop into the management shell
ssh root@head-node
cmsh

# Inside cmsh, get help and orient yourself
[head]% help
[head]% main             # return to the top-level prompt
[head]% status           # overall cluster status (services, license, HA)
[head]% partition use base
</code>

cmsh uses **modes**. You enter a mode (''device'', ''category'', ''softwareimage'', ''user'', etc.), then operate on objects inside it. After making changes, you **must** ''commit'' — pending changes show with an asterisk in the prompt.

==== 1.2 Deploying a system (provisioning a compute node) ====

The typical provisioning flow: pick or create a **software image**, attach it to a **category**, assign nodes to that category, then power them on so they PXE-boot and provision.

<code bash>
cmsh
[head]% softwareimage
[head->softwareimage]% list
[head->softwareimage]% clone default-image gpu-image
[head->softwareimage]% use gpu-image
[head->softwareimage[gpu-image]]% set kernelversion 5.15.0-...
[head->softwareimage[gpu-image]]% commit

# Create a category that uses this image
[head]% category
[head->category]% add gpu-nodes
[head->category[gpu-nodes*]]% set softwareimage gpu-image
[head->category[gpu-nodes*]]% commit

# Assign a node to the category and provision it
[head]% device
[head->device]% use node001
[head->device[node001]]% set category gpu-nodes
[head->device[node001]]% commit
[head->device[node001]]% power on        # or: reset, off
[head->device[node001]]% status          # watch state: INSTALLING -> UP

# Useful one-liners (run from cmsh prompt)
[head]% device list                       # all devices and state
[head]% device status -c gpu-nodes        # status filtered by category
[head]% device foreach -c gpu-nodes (status)
</code>

Outside cmsh, the equivalent inspection commands include:

<code bash>
cmha status              # HA status if head nodes are paired
module load shared       # load the BCM environment module
pdsh -g category=gpu-nodes uptime   # parallel ssh across a category
</code>

==== 1.3 Aligning baseline images ====

"Aligning" means making sure the running nodes match the software image on the head node. After you edit packages or files in the image, you sync them out.

<code bash>
# Update packages inside an image (chroot-style)
cm-chroot-sw-img /cm/images/gpu-image
# ... inside the chroot ...
yum install -y datacenter-gpu-manager
exit

# Push the image to running nodes — choose ONE depending on the situation
[head]% device use node001
[head->device[node001]]% imageupdate                # incremental sync (live)
[head->device[node001]]% reinstall                  # full PXE reprovision

# Verify nodes are in sync with the image
[head]% device imageupdate -c gpu-nodes --dry-run   # show what would change
</code>

Know the difference cold for the exam: **''imageupdate''** is a live rsync of the image to a running node (fast, but not all changes apply without reboot). **''reinstall''** wipes and PXE-provisions from scratch (clean but slow).

==== 1.4 Firmware updates ====

<code bash>
# Check current firmware across a category
[head]% device foreach -c gpu-nodes (get bmcsettings)

# BCM ships cm-update-firmware for BMC/BIOS rollouts
cm-update-firmware --help
cm-update-firmware --category gpu-nodes --firmware-package /root/fw/bios-1.2.bin

# GPU firmware (e.g., VBIOS) is usually handled by nvidia-smi or vendor tools
nvidia-smi -q | grep -i "vbios"
# Vendor flashing is typically done in a maintenance window with nodes drained
</code>

==== 1.5 User and permission management ====

<code bash>
[head]% user
[head->user]% add alice
[head->user[alice*]]% set commonname "Alice Researcher"
[head->user[alice*]]% set groupname researchers
[head->user[alice*]]% set password           # interactive
[head->user[alice*]]% commit

# Profiles / RBAC — assign what a user can do in BCM itself
[head]% profile list
[head]% profile use readonly
[head->profile[readonly]]% show
[head]% user use alice
[head->user[alice]]% set profile readonly
[head->user[alice]]% commit
</code>

Built-in profiles to remember: **admin**, **readonly**, **portal**, **cloudjob**. You can clone and customize them with ''profile clone admin custom-admin''.

==== 1.6 Reading baseline metrics with nvidia-smi and dcgmi ====

''nvidia-smi'' is the per-node quick look. ''dcgmi'' (Data Center GPU Manager) is the cluster-grade tool — it runs ''nv-hostengine'' as a service and supports persistent health checks, job-level stats, and policy.

<code bash>
# nvidia-smi — fast situational awareness
nvidia-smi                                      # default table
nvidia-smi -q                                   # exhaustive details
nvidia-smi -q -d TEMPERATURE,POWER,CLOCK,ECC    # only the sections you need
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,temperature.gpu \
           --format=csv -l 1                    # CSV, refresh every 1s
nvidia-smi dmon -s pucvmet -d 1                 # device monitor (power/util/clk/mem/ecc/temp)
nvidia-smi pmon -c 5                            # per-process, 5 samples
nvidia-smi topo -m                              # GPU/NIC topology matrix (NVLink, PIX, SYS)
nvidia-smi -pm 1                                # persistence mode on (keep driver loaded)
nvidia-smi -i 0 -ac 1215,1410                   # set memory,graphics application clocks

# dcgmi — cluster-grade health and diagnostics
systemctl status nvidia-dcgm                    # the host engine must be running
dcgmi discovery -l                              # list GPUs + entity IDs
dcgmi group -c mygroup --default                # create a group with all GPUs
dcgmi group -l                                  # list groups
dcgmi health -g <group_id> -s mpi               # set health watches (mem,power,thermal,nvlink...)
dcgmi health -g <group_id> -c                   # check current health
dcgmi diag -r 1                                 # quick diagnostic (~seconds)
dcgmi diag -r 2                                 # medium (~2 min)
dcgmi diag -r 3                                 # long (~15-30 min, real workload-like)
dcgmi diag -r 4                                 # extra long, stress-test grade
dcgmi dmon -e 203,204,250,252 -d 1000           # monitor specific field IDs every 1000ms
dcgmi stats -g <group_id> -e                    # enable job stats collection
dcgmi stats -s myjob                            # start tracking job "myjob"
dcgmi stats -x myjob                            # stop and print summary
</code>

Field IDs worth memorizing: **150** SM clock, **155** Memory clock, **203** GPU util, **204** Memory util, **250** Memory used, **252** Memory free, **1001+** profiling metrics (DCGM Profiling — SM active, tensor active, etc.).

==== 1.7 Practice scenario ====

> A user reports node005 is in the cluster but jobs aren't landing on it. Walk through the diagnosis.

<code bash>
cmsh
[head]% device use node005
[head->device[node005]]% status                 # is it UP / CLOSED / DOWN / INSTALLER_FAILED?
[head->device[node005]]% get category           # right category?
[head->device[node005]]% get powerstatus
[head->device[node005]]% latesthealthdata       # BCM health checks
[head->device[node005]]% events                 # recent events for this node
# If healthy in BCM but Slurm avoids it, check the scheduler (section 3)
</code>

----

===== 2. Kubernetes for AI Workloads =====

The NCP-AIO exam tests the NVIDIA GPU Operator stack: **Node Feature Discovery → Driver → Container Toolkit → Device Plugin → DCGM Exporter → MIG Manager → GPU Feature Discovery**. You should be able to install it, inspect each component, and request GPUs in a pod spec.

==== 2.1 Install the GPU Operator ====

<code bash>
# Prereqs: containerd or CRI-O, a running cluster, helm 3+, kubectl
kubectl create ns gpu-operator
kubectl label ns gpu-operator pod-security.kubernetes.io/enforce=privileged

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Most common install: let the operator manage drivers AND toolkit
helm install --wait gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true

# If drivers are already on the host, disable the operator's driver
helm install --wait gpu-operator nvidia/gpu-operator \
  -n gpu-operator --set driver.enabled=false
</code>

==== 2.2 Verify every layer is healthy ====

<code bash>
kubectl get pods -n gpu-operator                # all components Running/Completed
kubectl get nodes -o json | jq '.items[].status.capacity' | grep nvidia.com
# Expect: "nvidia.com/gpu": "8" (or similar)

kubectl describe node <gpu-node> | grep -A5 -i "nvidia.com"
kubectl get clusterpolicies -n gpu-operator -o yaml | less   # the operator's CR

# Look at GPU Feature Discovery labels (they drive scheduling)
kubectl get node <gpu-node> -o json | jq '.metadata.labels' | grep nvidia
# Examples: nvidia.com/gpu.product=A100-SXM4-80GB
#           nvidia.com/cuda.driver.major=535
#           nvidia.com/mig.strategy=single
</code>

==== 2.3 Request GPUs in a Pod ====

<code yaml>
apiVersion: v1
kind: Pod
metadata:
  name: cuda-smoke-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda
    image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
</code>

<code bash>
kubectl apply -f cuda-smoke-test.yaml
kubectl logs cuda-smoke-test                   # should show the GPU table
</code>

==== 2.4 MIG with the GPU Operator ====

MIG is managed by the **MIG Manager** component. Switch a node into a profile by labeling it:

<code bash>
# List the available MIG configs from the configmap
kubectl get cm -n gpu-operator default-mig-parted-config -o yaml

# Apply a profile to a node — the MIG Manager will partition the GPU
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite

# Watch the MIG Manager pod do its work
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -f

# After it succeeds, the node advertises sliced resources
kubectl describe node <gpu-node> | grep nvidia.com/mig
# e.g. nvidia.com/mig-1g.10gb: 7
</code>

In a pod, you request the sliced resource by name:

<code yaml>
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1
</code>

Two **MIG strategies** controlled at install time — know both:
  * **single** — node exposes one resource type, e.g. ''nvidia.com/gpu'' mapped to slices.
  * **mixed** — node exposes each profile as its own resource (e.g. ''nvidia.com/mig-1g.10gb'', ''nvidia.com/mig-2g.20gb'').

==== 2.5 Time-slicing (oversubscribe a GPU without MIG) ====

<code yaml>
# ConfigMap consumed by the device plugin
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
</code>

<code bash>
kubectl apply -f time-slicing-config.yaml
# Tell the cluster policy to use it
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
</code>

Each physical GPU now advertises 4 logical GPUs — useful for inference and dev workloads but **not** isolated like MIG.

==== 2.6 DCGM Exporter and metrics ====

The DCGM exporter pod scrapes GPU metrics and exposes them on ''%%:9400/metrics%%'' for Prometheus.

<code bash>
kubectl get svc -n gpu-operator nvidia-dcgm-exporter
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
</code>

==== 2.7 Troubleshooting checklist ====

<code bash>
# Pod stuck Pending — almost always a scheduling/resource issue
kubectl describe pod <name> | tail -30
# Look for: "0/3 nodes are available: 3 Insufficient nvidia.com/gpu"

# Operator components crash-looping
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator <pod>
kubectl logs -n gpu-operator <pod> --previous     # crashed container

# Driver pod stuck — check kernel module compile
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# Toolkit not wired into containerd
kubectl logs -n gpu-operator -l app=nvidia-container-toolkit-daemonset
# Verify the runtime config on the host:
cat /etc/containerd/config.toml | grep -A5 nvidia

# Validation pod runs at the end of install — its logs prove the stack works
kubectl logs -n gpu-operator -l app=nvidia-operator-validator -c nvidia-operator-validator
</code>

----

===== 3. Slurm & Workload Management =====

Slurm is the dominant HPC scheduler. The exam tests GPU job scheduling via **GRES**, MIG-aware scheduling, and queue troubleshooting.

==== 3.1 Core commands ====

<code bash>
sinfo                            # partition + node state overview
sinfo -N -l                      # per-node detail
sinfo -o "%P %N %G %C %t"        # partition, nodes, GRES, CPUs, state

squeue                           # current queue
squeue -u alice                  # per user
squeue --start                   # estimated start times

scontrol show node nodeA100-01   # full node detail (GRES, state, reason)
scontrol show job 12345          # full job detail
scontrol show partition gpu

sacct -j 12345 --format=JobID,State,ExitCode,Elapsed,MaxRSS,ReqTRES,AllocTRES
sacct -S 2026-05-01 -u alice -X  # accounting since a date
sreport cluster utilization start=2026-05-01
</code>

==== 3.2 Submitting GPU jobs ====

<code bash>
# Interactive single-GPU shell
srun --gres=gpu:1 --pty bash

# Interactive, target a specific GPU model
srun --gres=gpu:a100:2 --pty bash
</code>

Batch script:

<file bash train.sbatch>
#!/bin/bash
#SBATCH --job-name=resnet
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=04:00:00
#SBATCH --output=resnet-%j.out

module load cuda/12.4
nvidia-smi
srun python train.py
</file>

<code bash>
sbatch train.sbatch
</code>

Cancel and inspect:

<code bash>
scancel 12345
scancel -u alice                 # all jobs by alice
scancel -t PENDING -u alice      # only pending
</code>

==== 3.3 Configuring GRES for GPUs ====

Two files matter — both on every compute node and the controller.

**''/etc/slurm/gres.conf''** (per-node, declares what hardware exists):

<file - gres.conf>
# /etc/slurm/gres.conf on nodeA100-01
Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-15
Name=gpu Type=a100 File=/dev/nvidia1 Cores=0-15
Name=gpu Type=a100 File=/dev/nvidia2 Cores=16-31
Name=gpu Type=a100 File=/dev/nvidia3 Cores=16-31
</file>

**''/etc/slurm/slurm.conf''** (cluster-wide, declares GRES is in use):

<file - slurm.conf>
GresTypes=gpu,mig
NodeName=nodeA100-[01-08] Gres=gpu:a100:4 CPUs=64 RealMemory=512000 State=UNKNOWN
PartitionName=gpu Nodes=nodeA100-[01-08] Default=YES MaxTime=24:00:00 State=UP
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageTRES=gres/gpu
</file>

Push and reload:

<code bash>
# After editing slurm.conf / gres.conf on the controller, sync to nodes,
# then reconfigure without restart if possible:
scontrol reconfigure
# Heavier changes (new partitions, GresTypes) need:
systemctl restart slurmctld         # on controller
systemctl restart slurmd            # on each node
</code>

==== 3.4 MIG + Slurm ====

When MIG is enabled, each slice should appear as its own GRES type so Slurm can schedule against it.

<file - gres.conf (MIG)>
# gres.conf — one line per slice
Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap21
Name=gpu Type=1g.10gb File=/dev/nvidia-caps/nvidia-cap30
Name=gpu Type=2g.20gb File=/dev/nvidia-caps/nvidia-cap39
</file>

<file - slurm.conf (MIG)>
NodeName=mig-node-01 Gres=gpu:1g.10gb:6,gpu:2g.20gb:1 ...
</file>

Submit against a slice:

<code bash>
srun --gres=gpu:1g.10gb:1 --pty nvidia-smi -L
</code>

The actual ''/dev/nvidia-caps/'' paths come from ''ls /proc/driver/nvidia/capabilities/mig/'' after MIG is enabled with ''nvidia-smi mig -cgi ... -C''.

==== 3.5 Why is my job stuck? The reason codes ====

<code bash>
squeue --format="%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R"
# The last column is the REASON. Common ones:
#   Resources             - waiting for nodes (normal queueing)
#   Priority              - other higher-priority jobs ahead
#   ReqNodeNotAvail       - requested node is down/drained
#   AssocGrpGRESLimit     - hit a per-association GRES quota
#   QOSMaxGRESPerUser     - QoS limit
#   PartitionTimeLimit    - asked for more time than partition allows
#   Dependency            - waiting on another job
#   InvalidQOS            - QoS not granted to user
#   ReqGresTypeNotAvail   - asked for gpu:h100 when only a100 exists
</code>

When a node is DRAIN or DOWN:

<code bash>
sinfo -R                              # nodes in drain + the reason
scontrol show node nodeA100-03 | grep -E "State|Reason"
scontrol update NodeName=nodeA100-03 State=RESUME   # bring it back after fixing
scontrol update NodeName=nodeA100-03 State=DRAIN Reason="bad GPU"
</code>

==== 3.6 Practice scenario ====

> A 4-GPU H100 job sits in PD with reason ''Resources'' forever, even though ''sinfo'' shows idle H100 nodes. Diagnose.

<code bash>
# 1. Confirm the job actually requested the right TYPE
scontrol show job <id> | grep -E "TresPerNode|Gres"

# 2. Confirm Slurm knows the nodes have h100 GRES (not just "gpu")
scontrol show node nodeH100-01 | grep -E "Gres|CfgTRES"
# If you see Gres=gpu:4 (no type), gres.conf is missing Type=h100

# 3. Check for QoS / association limits eating the job
sacctmgr show assoc user=$USER format=Account,User,QOS,GrpTRES,MaxTRES
sacctmgr show qos format=Name,GrpTRES,MaxTRESPerUser

# 4. Check for reservations blocking the partition
scontrol show reservation
</code>

----

===== 4. System Troubleshooting & Optimization =====

The exam puts you in front of a broken cluster and asks for the **next command**. Build a mental flowchart: GPU → driver → container runtime → scheduler → network.

==== 4.1 GPU and driver layer ====

<code bash>
# Is the hardware visible?
lspci | grep -i nvidia

# Is the driver loaded?
lsmod | grep nvidia
nvidia-smi                            # if this fails, no driver / wrong driver

# Kernel messages — ECC errors, Xid events, fallen-off-bus, thermal
dmesg -T | grep -iE "nvidia|nvrm|xid"
journalctl -k --since "1 hour ago" | grep -i nvidia

# Xid codes you should know on sight:
#   13   - Graphics Engine Exception (often app bug or bad memory)
#   31   - GPU memory page fault (illegal address in CUDA code)
#   43   - Reset channel verif error
#   48   - Double-bit ECC error (uncorrectable, GPU usually needs reset)
#   63/64 - Row-remapper recording/failure (HBM page retirement)
#   74   - NVLink error
#   79   - GPU fell off the bus (PCIe/power — often hardware)
#   92   - High single-bit ECC (correctable but worth watching)
#   94/95 - Contained/uncontained ECC error
#   119  - GSP RPC timeout
nvidia-smi -q -d ECC,PAGE_RETIREMENT,REMAPPED_ROWS

# Generate a full bug report (attach to support tickets)
nvidia-bug-report.sh
</code>

==== 4.2 DCGM diagnostics ====

''dcgmi diag'' is the canonical "is this GPU healthy" tool — run it before blaming software.

<code bash>
dcgmi diag -r 1            # ~seconds, software checks
dcgmi diag -r 2            # ~2 min, includes targeted stress
dcgmi diag -r 3            # ~30 min, sustained workloads (memory bw, SM stress, NVLink)
dcgmi diag -r 4            # extended, hours

# Per-GPU policy alerts (set once, leave running)
dcgmi policy --set 0,0 -p 250 -T 95 -M 5   # warn on power>250W, temp>95C, mem errors
dcgmi policy --get -g 0
</code>

If ''dcgmi diag'' flags a GPU, capture the JSON:

<code bash>
dcgmi diag -r 3 -j > diag.json
</code>

==== 4.3 Container runtime problems ====

<code bash>
# Does the toolkit see the GPU?
nvidia-container-cli info
nvidia-container-cli list

# Containerd config wiring
crictl info | jq '.config.containerd.runtimes'
cat /etc/containerd/config.toml | grep -A3 nvidia

# Run a known-good test container outside of k8s to isolate
ctr run --rm --gpus 0 \
  docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
</code>

For Docker:

<code bash>
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
</code>

If this fails but ''nvidia-smi'' on the host works, the **container toolkit** (not the driver) is the problem.

==== 4.4 Performance bottleneck triage ====

When a job runs but is slow, walk the stack:

<code bash>
# 1. Is the GPU actually busy?
nvidia-smi dmon -s u -d 1
# sm/mem util both low -> CPU/IO bound; mem high, sm low -> memory bound

# 2. Are clocks being throttled? Look at the THROTTLE reasons.
nvidia-smi -q -d PERFORMANCE
# Watch for: HW Slowdown=Active, SW Thermal Slowdown, Power Brake

# 3. Power and thermal headroom
nvidia-smi --query-gpu=power.draw,power.limit,temperature.gpu --format=csv -l 1

# 4. Memory pressure / page retirement
nvidia-smi -q -d PAGE_RETIREMENT,REMAPPED_ROWS,ECC

# 5. NVLink / PCIe link health
nvidia-smi nvlink -s                 # state of each link
nvidia-smi nvlink -e                 # error counters
nvidia-smi -q -d PIDS,CLOCK          # processes and current clocks
lspci -s <bdf> -vvv | grep -i "lnksta"  # PCIe gen + width — is it at advertised speed?

# 6. Topology — wrong GPU-to-NIC binding kills multinode perf
nvidia-smi topo -m
ibstat                               # IB ports, rates
ibdev2netdev                         # IB device <-> netdev mapping
</code>

==== 4.5 Network layer (RDMA / InfiniBand / NCCL) ====

<code bash>
ibstat                               # link state, rate, port GUID
ibstatus
iblinkinfo                           # full fabric link summary
ibping -S; ibping -G <guid>          # one node as server, another pings
perfquery                            # port counters; rerun for deltas
ibdiagnet                            # subnet diagnosis (run on a single node)

# NCCL — almost every multi-GPU AI failure shows up here
NCCL_DEBUG=INFO mpirun -np 8 ./my_nccl_test
# Important env vars to know:
#   NCCL_IB_HCA           - which HCAs to use
#   NCCL_SOCKET_IFNAME    - which Ethernet IFs to allow
#   NCCL_P2P_DISABLE      - disable peer-to-peer (debug only)
#   NCCL_IB_DISABLE       - force TCP fallback (debug only)
#   NCCL_TOPO_DUMP_FILE   - dump the topology NCCL discovered

# Standard health test
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8
</code>

==== 4.6 Scheduling and capacity failures ====

<code bash>
# K8s: pod won't schedule on GPU node
kubectl describe pod <name>          # bottom Events section is the answer
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Slurm: queue full but nodes "look" idle
sinfo -R                             # any drained nodes?
scontrol show node <node> | grep -E "Reason|State|Gres"
sdiag                                # scheduler internals - cycle times, backlog
</code>

==== 4.7 BCM-level health checks ====

<code bash>
cmsh
[head]% monitoring
[head->monitoring]% measurable list           # available metrics
[head->monitoring]% healthcheck list          # built-in checks (mounts, gpu, ib, ...)
[head->monitoring]% latesthealthdata          # current health snapshot

[head]% device latesthealthdata -c gpu-nodes
[head]% events                                # cluster-wide event stream
</code>

==== 4.8 The triage flowchart to memorize ====

^ Symptom                                ^ First command                            ^ Then                                            ^
| ''nvidia-smi: command not found''      | ''which nvidia-smi''                     | Is driver installed?                            |
| "No devices were found"                | ''lsmod %%|%% grep nvidia''              | ''dmesg %%|%% grep -i nvidia''                  |
| GPU shows in lspci, not nvidia-smi     | ''dmesg -T %%|%% grep Xid''              | Driver / module / fallen off bus                |
| Container can't see GPU                | ''nvidia-container-cli info''            | Toolkit + runtime config                        |
| Pod Pending                            | ''kubectl describe pod''                 | Resources / labels / taints                     |
| Slurm job PD forever                   | ''squeue'' + ''scontrol show job''       | GRES type? QoS? Reservation?                    |
| Job runs but slow                      | ''nvidia-smi dmon''                      | Throttling reasons + topology                   |
| Multi-node training slow               | ''NCCL_DEBUG=INFO''                      | IB link state, topo, GPU<->NIC                  |
| ECC errors / Xid 48/63/64              | ''nvidia-smi -q -d ECC''                 | Drain node, page retirement                     |

----

===== 5. Docker & Container Runtimes =====

Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/Deployment and Troubleshooting. The exam treats Docker as the **substrate** — what gets tested is "does this container have GPU access?" and "how do I diagnose it when something's wrong?" Know the runtime layer for standalone GPU servers (Docker) and for Kubernetes nodes (containerd).

==== 5.1 The install order — memorize it ====

  - **NVIDIA driver** (host kernel module)
  - **Docker** (or containerd for k8s nodes)
  - **NVIDIA Container Toolkit** (provides the ''nvidia'' runtime shim)
  - **Configure the runtime** to advertise the toolkit

<code bash>
# Configure Docker to use the nvidia runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Configure containerd (for Kubernetes nodes)
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

# Verify the runtime is wired in
docker info | grep -i runtime
cat /etc/containerd/config.toml | grep -A5 nvidia
</code>

==== 5.2 NGC registry authentication ====

Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from ''nvcr.io'' requires auth. **This is the classic trip-up question on the exam.**

<code bash>
docker login nvcr.io
# Username: $oauthtoken           <-- LITERAL string, with the dollar sign
# Password: <your NGC API key>    <-- from ngc.nvidia.com

# Or non-interactive (for scripts / cmsh image builds)
echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin

# Then pull
docker pull nvcr.io/nvidia/pytorch:24.10-py3
docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
docker pull nvcr.io/nvidia/tensorrt:24.10-py3
</code>

The username is literally ''%%$oauthtoken%%'' as a string — not your NVIDIA email, not a shell variable. Memorize this.

==== 5.3 Running GPU containers ====

<code bash>
# Verify GPU access from a container (the smoke test)
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

# GPU selection flags — know all the forms
docker run --gpus all ...                            # all GPUs
docker run --gpus 2 ...                              # any 2 GPUs
docker run --gpus '"device=0,2"' ...                 # specific GPU indices
docker run --gpus '"device=GPU-abc123..."' ...       # by UUID
docker run --gpus '"device=MIG-d1..."' ...           # by MIG slice UUID

# Pass through capabilities and environment
docker run --gpus all \
  --env NVIDIA_VISIBLE_DEVICES=all \
  --env NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -v /data:/workspace/data \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  nvcr.io/nvidia/pytorch:24.10-py3
</code>

The ''--ipc=host'' and ''ulimit'' flags come up because PyTorch/NCCL multi-process workloads need shared memory and unlocked memory limits to work right.

==== 5.4 Troubleshooting containers ====

<code bash>
# Inventory
docker ps                            # running containers
docker ps -a                         # all containers, including stopped/exited
docker images                        # local image cache
docker system df                     # disk usage by images/containers/volumes

# Inspection
docker logs <container>              # stdout/stderr
docker logs -f <container>           # follow
docker logs --tail 100 <container>   # last N lines
docker logs --since 10m <container>  # time-bounded
docker stats                         # live CPU/mem/net/IO per container
docker stats --no-stream <container> # one-shot snapshot
docker inspect <container>           # full JSON config + state
docker inspect <container> | jq '.[0].State'   # exit code, OOMKilled, etc.
docker top <container>               # processes inside the container
docker exec -it <container> bash     # shell into it

# Daemon-level
docker events                        # live event stream from the daemon
journalctl -u docker -f              # daemon logs via systemd
journalctl -u containerd -f          # containerd logs (k8s nodes)
</code>

==== 5.5 Error messages to recognize on sight ====

^ Error                                                                  ^ Cause                                                                  ^
| ''could not select device driver "" with capabilities: [[gpu]]''       | NVIDIA Container Toolkit not installed/configured, or no GPU on host    |
| ''Failed to initialize NVML: Driver/library version mismatch''         | Host driver was updated, container has stale ''libnvidia-ml.so'' loaded |
| ''nvidia-container-cli: initialization error: nvml error: driver/library version mismatch'' | Reboot the host — kernel module out of sync with userspace            |
| ''OCI runtime create failed: ... unknown capability "nvidia"''         | Runtime not registered with Docker — rerun ''nvidia-ctk runtime configure'' |
| Container starts but ''nvidia-smi'' says ''No devices were found''     | ''NVIDIA_VISIBLE_DEVICES'' was overridden to ''none'' or ''void''       |
| ''docker: Error response from daemon: pull access denied for nvcr.io/...'' | Not logged into NGC, or API key expired                            |

==== 5.6 containerd specifics for Kubernetes ====

When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and ''crictl''.

<code bash>
# Inspect what containerd sees
crictl info | jq '.config.containerd.runtimes'
crictl ps -a                         # all containers, including stopped
crictl logs <container-id>
crictl inspect <container-id>

# Smoke-test the runtime directly without k8s in the middle
ctr image pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
ctr run --rm --gpus 0 \
  nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
</code>

If ''ctr run --gpus'' works but a pod doesn't, the problem is in the device plugin or k8s config — not the runtime.

==== 5.7 Practice scenario ====

> A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run with ''%%docker run --gpus all%%''. Diagnose.

<code bash>
# 1. Was there any output at all?
docker logs <container-id>
docker inspect <container-id> | jq '.[0].State'
# Look at ExitCode and OOMKilled

# 2. Was the toolkit even invoked?
docker info | grep -i runtime
# Should list nvidia among runtimes

# 3. Can a minimal container see the GPU?
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

# 4. If the minimal works but TF doesn't, it's app-level. Check the image's entrypoint.
docker run --rm --gpus all -it --entrypoint bash nvcr.io/nvidia/tensorflow:<tag>
# Inside, manually run the entrypoint to see what fails

# 5. Common gotchas:
#    - Missing --ipc=host (silent crash on first NCCL call)
#    - --memory limit below model size (OOMKilled, ExitCode=137)
#    - Mounted dataset path doesn't exist inside the container
</code>

----

===== 6. Run:ai (NVIDIA Run:ai) =====

Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the **conceptual model is highly testable**. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI.

Note: the open-source **KAI Scheduler** is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster's Kubernetes to practice the scheduling concepts (queues, fair-share, gang scheduling) without a Run:ai license.

==== 6.1 The conceptual model ====

The hierarchy, top to bottom:

  - **Tenant** — top-level isolation (usually one per organization)
  - **Cluster** — a Kubernetes cluster managed by Run:ai
  - **Department** — group of projects, gets a slice of cluster quota
  - **Project** — the multi-tenancy unit; users belong to projects, workloads run inside them
  - **Workload** — the actual thing running (workspace, training, inference, etc.)

==== 6.2 Quota model ====

This comes up constantly on the exam.

  * **Deserved quota** (a.k.a. //guaranteed//) — GPUs your project is guaranteed when it asks. **Cannot be preempted.**
  * **Over-quota** — opportunistic use of idle GPUs beyond your deserved quota. **Can be preempted** when another project wants its deserved share back.
  * **Over-quota priority** — which project gets leftovers first when multiple projects compete for over-quota.
  * **Over-quota weight** — proportional fair-share for over-quota allocation across projects with equal priority.

Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed.

==== 6.3 Workload types ====

^ Type                       ^ Use case                                  ^ Priority    ^ Preemptible ^
| **Workspace**              | Interactive dev (Jupyter, VSCode)         | Low         | Yes         |
| **Training**               | Batch training jobs                       | Medium      | Yes         |
| **Distributed Training**   | Multi-node PyTorch/TF/MPI/XGBoost         | Medium      | Yes (gang)  |
| **Inference**              | Model serving with autoscaling            | High        | **No**      |

**Priority order, low → high:** Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce.

**Gang scheduling** for distributed training: all worker pods start together or none do. No partial-start states.

==== 6.4 The runai CLI ====

<code bash>
# Authentication and context
runai login
runai config project <project-name>
runai list projects
runai list clusters

# Submitting — know the variants
runai submit-workspace <name> --image jupyter/base-notebook --gpu 1
runai submit <name> --image pytorch/pytorch --gpu 2 \
  --command -- python train.py
runai submit-dist pytorch <name> --workers 4 --gpu 1 \
  --image nvcr.io/nvidia/pytorch:24.10-py3

# Fractional GPU — three different ways
runai submit <name> --gpu 1                  # whole GPU
runai submit <name> --gpu-memory 4G          # fraction by memory (4 GB)
runai submit <name> --gpu-fraction 0.5       # fraction by share (50%)

# Targeting hardware
runai submit <name> --node-pools "h100-pool"
runai submit <name> --node-type "dgx-a100"

# Inspecting
runai list jobs
runai describe job <name>
runai logs <name>
runai logs -f <name>
runai bash <name>                            # exec into the pod
runai port-forward <name> --port 8888

# Lifecycle
runai delete job <name>
runai suspend job <name>
runai resume job <name>
</code>

The difference between ''%%--gpu 1%%'', ''%%--gpu-memory 4G%%'', and ''%%--gpu-fraction 0.5%%'' is a **classic exam question** — whole GPU vs fraction by memory vs fraction by share.

==== 6.5 Fractional GPU is NOT MIG ====

^ Property              ^ MIG                            ^ Run:ai Fractional GPU                 ^
| Isolation             | Hardware (memory, SMs, L2)     | Software (runtime-enforced)            |
| Supported GPUs        | A100, A30, H100, H200, B200    | Any NVIDIA GPU                         |
| Slice sizes           | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share)     |
| Reconfiguration       | Requires draining the GPU      | Per-workload, no GPU reset             |
| Memory protection     | Hardware-enforced              | Process-level via runtime              |

If a question asks "tenant needs hardware-isolated GPU partitioning across an A100" → **MIG**. If it asks "team needs to run lots of small Jupyter notebooks on a V100/T4/L4" → **Run:ai Fractional GPU** (those GPUs don't support MIG).

==== 6.6 Node pools ====

Node pools segment the cluster by hardware capability.

<code bash>
# Typical pools in a heterogeneous cluster:
#   a100-pool       - A100 nodes, MIG-enabled
#   h100-pool       - H100 nodes
#   l40-pool        - inference-grade GPUs
#   cpu-pool        - CPU-only nodes for data prep

# Projects are granted access to specific pools, with separate quotas per pool.
# Workloads target pools at submission time.
runai submit train-large --node-pools "h100-pool" --gpu 8
runai submit prep-data --node-pools "cpu-pool" --cpu 16
</code>

==== 6.7 Architecture ====

  * **Run:ai Control Plane** — the management layer. Can be **SaaS** (hosted by NVIDIA) or **self-hosted**. One Control Plane can manage many clusters.
  * **Run:ai Cluster components** — installed via Helm onto each managed Kubernetes cluster:
    * The scheduler (now based on KAI Scheduler)
    * The Run:ai operator
    * The fractional GPU runtime
    * Monitoring agents
    * Workload controllers (one per workload type)

Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift.

==== 6.8 KAI Scheduler — the open-source path ====

Since full Run:ai requires a license, the open-source **KAI Scheduler** is the practical way to drill the scheduling concepts on your own cluster.

<code bash>
# Install on an existing Kubernetes cluster
helm repo add nvidia-kai-scheduler https://nvidia.github.io/KAI-Scheduler
helm install kai-scheduler nvidia-kai-scheduler/kai-scheduler \
  -n kai-scheduler --create-namespace

# Verify
kubectl get pods -n kai-scheduler
</code>

What you can practice with KAI alone (no Run:ai license needed):

  * Queue hierarchies and weighted fair-share
  * Gang scheduling for distributed pods
  * Pod priority and preemption
  * Resource quotas across queues
  * Even without GPUs — use CPU-only pods to practice the scheduler logic

==== 6.9 Practice scenario ====

> A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose.

This is **expected behavior**, not a bug. Workspaces are low-priority and preemptible; training jobs are higher priority. When a training job submitted by another project wants GPUs that the data science project is using only as over-quota, Run:ai preempts the workspace. Options:

  - Increase the data science project's **deserved quota** so the workspaces are no longer in over-quota territory.
  - Move long-running data science work into **training-type workloads** (higher priority than workspaces).
  - Add **node pool** segregation — a dedicated pool for interactive work that batch training jobs can't preempt into.
  - Schedule batch training jobs to a **different time window** via job priority/scheduling rules.

----

===== Final exam prep tips =====

  * **Practice in cmsh blind.** The web UI is not what's being tested. Build muscle memory for ''device'', ''category'', ''softwareimage'', ''user'', ''monitoring'', and ''commit''.
  * **Know the difference** between ''imageupdate'' (live sync) and ''reinstall'' (full PXE) — that's a classic question.
  * **Memorize the GPU Operator components in order** and what each one does. If one is broken, what symptom appears?
  * **Know MIG strategies** (''single'' vs ''mixed'') and how to request slices in both Kubernetes (''nvidia.com/mig-1g.10gb'') and Slurm (''%%--gres=gpu:1g.10gb:1%%'').
  * **Recognize Xid codes** at least for 13, 31, 48, 63, 79, 94/95.
  * **''dcgmi diag -r {1,2,3,4}''** levels — the time and depth differences come up.
  * **Slurm reason codes** — ''Resources'', ''Priority'', ''ReqNodeNotAvail'', ''AssocGrpGRESLimit'', ''QOSMaxGRESPerUser'', ''ReqGresTypeNotAvail''.
  * **NCCL env vars** — ''NCCL_DEBUG'', ''NCCL_IB_HCA'', ''NCCL_SOCKET_IFNAME''.
  * **NGC auth — username is literally ''%%$oauthtoken%%''**, password is the NGC API key. This trips people up every exam cycle.
  * **Docker ''%%--gpus%%'' flag variants** — ''all'', ''2'', ''%%"device=0,2"%%'', by UUID, by MIG UUID.
  * **The "could not select device driver" error** = NVIDIA Container Toolkit missing or no GPU on host.
  * **Run:ai quota model** — //deserved// is guaranteed, //over-quota// is preemptible. Inference > Training > Workspace in priority.
  * **Fractional GPU ≠ MIG** — Fractional is software-enforced and works on any NVIDIA GPU; MIG is hardware-enforced and only on A100/H100/etc.
  * **runai submit flags** — ''%%--gpu 1%%'' (whole), ''%%--gpu-memory 4G%%'' (fraction by memory), ''%%--gpu-fraction 0.5%%'' (fraction by share).

Good luck — build the labs, break them on purpose, fix them, repeat.

[[home-page|AI Cloud Managed Services Policies and Procedures]]