This is an old revision of the document!
Certification: NVIDIA-Certified Professional: AI Operations (NCP-AIO)
Format: 30 multiple-choice questions + 3 hands-on lab exercises, 120 minutes total
Cost: $500 USD, online proctored
Recommended experience: 2–3 years of data center work with NVIDIA hardware.
Plan length: 8 weeks @ ~8–10 hours/week (≈70 hours total)
| Topics Areas | % of Exam | Topics Covered |
|---|---|---|
| Installation and Deployment | 31% | * Describe the Mission Control toolkit * Use BCM’s Base View interface to monitor cluster performance, resource utilization, and node health in real time. * Manage job scheduling and resource allocation using BCM’s workload manager (e.g., SLURM or Kubernetes) * Apply patches, update firmware, and synchronize software images across cluster nodes using BCM * Administer user accounts, roles, and permissions to ensure secure access to the cluster using BCM * Configure and monitor network settings for cluster nodes, DPUs, and switches using BCM * Diagnose and resolve cluster issues, such as job failures, node outages, or resource bottlenecks, using BCM. * Use BCM to organize and configure compute nodes into categories based on hardware or workload requirements. * Using BCM, maintain documentation and generate reports on cluster usage, performance, and issues. * Install and initialize Kubernetes on NVIDIA hosts using BCM * Deploy DOCA Services on DPU Arm * Install Run:ai * Install Slurm |
| Administration | 23% | * Administer Slurm cluster. * Describe data center architecture for AI Workloads * Administer Run:ai * Administer Kubernetes * Configure MIG |
| Workload Management | 23% | * Deploy inference workloads with Kubernetes * Deploy inference workloads with Run:ai * Deploy training workloads with Slurm * Deploy training workloads with Run:ai * Use system management tools to troubleshoot issues\ \* Allocate resources between teams with Run:ai, Slurm and Kubernetes * Deploy containers from NGC |
| Troubleshooting and Optimization | 23% | * Troubleshoot Docker * Troubleshoot the fabric manager service for NVLink and NVSwitch systems * Troubleshoot Base Command Manager\\. * Troubleshoot Magnum IO components * Troubleshoot storage performance * Troubleshoot the deployment of a container from NGC |
Each week is built around a tight cluster of related exam tasks. Every module has four parts:
If you do not have hardware, NVIDIA LaunchPad provides free, time-boxed access to real NVIDIA stacks (BCM, DGX, BlueField, Run:ai). Pair it with a small cloud GPU instance (one A100 or L40S is enough for MIG and single-node practice).
The exam's four domains and approximate weights:
| Domain | Weight |
|---|---|
| Administration | ~30% |
| Workload Management | ~20% |
| Installation & Deployment | ~30% |
| Troubleshooting | ~20% |
Before starting, confirm your baseline. If any of these feel shaky, address them first.
systemd, journalctl, ssh, scp, package management on Ubuntu and RHEL families.ip, ss, tcpdump, basic InfiniBand vocabulary (ibstat, subnet manager).docker logs, docker exec.nvidia-smi, NVIDIA driver vs CUDA toolkit vs Container Toolkit (the three are not the same).Pre-work resources:
Set up your environment now:
commands.md you keep adding to. Re-running your own past commands during the lab portion of the exam is the fastest way to recover under pressure.If you have DC Lab Sim installed, treat it as a high-value but partial supplement to this plan. Important context:
Where it directly supports this AIO plan:
sinfo, squeue, scontrol, sbatch, and srun simulators until invocation is reflexive.nvidia-smi -mig workflows, docker, and enroot commands.nvidia-smi (especially -q, dmon, nvlink -s, topo -m), dcgmi diag, ibstat, iblinkinfo, and ipmitool. Run as many of the 32 scenarios as you can — even those framed for AII bring-up will sharpen your diagnostic instinct.Where it is silent on AIO content (do not rely on the simulator for these — use LaunchPad, real labs, and docs):
runai CLI, projects, fractional GPUs, multi-tenant quotas)kubectl, GPU Operator, Triton, NIM)In the weeks below, look for Simulator drills callouts where it earns its keep.
You have three complementary environments:
Primary: BCM-managed hybrid cluster.
cm-kubernetes-setup).p4d.24xlarge for A100, p5.48xlarge for H100, smaller g5/g6 for cheap practice).This is where Weeks 2, 3, 4, 5, and 7 happen.
Secondary: External A100 cluster with Run:ai pre-installed.
runai submit), use fractional GPU and (potentially) MIG profile requests, observe your own workloads, and navigate whatever project/department structure your role exposes.This is where Week 6 hands-on workload submission and as much administration as your role permits happen.
Tertiary: Two DGX Sparks running MicroK8s, connected back-to-back via QSFP112 in RoCE mode. GB10 Grace Blackwell, 128 GB unified memory each, ConnectX-7 at 200 Gb/s. The Spark's CX-7 is firmware-locked to Ethernet (per the official DGX Spark User Guide: “The DGX Spark CX-7 ports support ethernet configuration only”). The Sparks give you real RDMA, GPUDirect RDMA, and NCCL diagnostics over RoCE — useful for cheap K8s iteration in Week 5 and Week 8 RoCE-flavored troubleshooting. Explicitly not supported by Run:ai (DGX Spark is on the Run:ai unsupported list per official docs).
This covers the bulk of the plan with real hardware. Below is what fits where, and where to substitute.
| Plan content | Run on | Notes |
|---|---|---|
Week 2–3 BCM (Base View, cmsh, categories, images, users, reports) | BCM head node + AWS-deployed compute nodes | Best possible setup. Use cheap g5.xlarge or g6.xlarge nodes for image and category practice; bring up p4d/p5 only for the workload-heavy exercises. |
| Week 4 Slurm install + multi-node training | BCM + AWS GPU nodes | Provision two GPU nodes, install Slurm with Enroot/Pyxis through BCM's Slurm role. Cross-node NCCL works over AWS EFA. |
| Week 5 Kubernetes admin + inference | Primary: BCM-managed AWS K8s cluster (deployed via cm-kubernetes-setup). Optional alternative: two-Spark MicroK8s for cheap, fast iteration on basic kubectl/GPU Operator concepts. | Do the exam-relevant K8s administration work — BCM-deployed K8s, GPU Operator manual install, Triton/NIM inference deployments, ResourceQuotas — on the BCM AWS cluster. Use Sparks only for tinkering on concepts you can test cheaply. |
| Week 6 Run:ai workload submission, project navigation, fractional GPU, MIG profile requests | External A100 Run:ai cluster | Submit workloads via runai submit, navigate UI within your role's visibility, exercise fractional GPU and (if cluster admin has enabled it) Dynamic MIG profile requests. Request observer/department-admin access from the cluster owner if possible — closes most of the admin-task gap. |
| Week 6 Run:ai install + cluster-level admin (creating departments, projects, quotas, node roles) | LaunchPad Run:ai lab + documentation study | The install task is on the exam. You cannot install Run:ai on your BCM AWS cluster (no license) and you cannot install on the A100 cluster (already installed by someone else). LaunchPad is the only hands-on install venue. Document the install flow as your own runbook from BCM/Run:ai docs first, then execute it once on LaunchPad. |
Week 7 MIG (low-level nvidia-smi mig, GPU Operator MIG Manager mode switching) | BCM-managed AWS K8s cluster (size at least one worker as p4d.24xlarge for A100 or p5.48xlarge for H100) | Needs shell access to the GPU node and admin access to the K8s cluster — both of which you have on your BCM AWS cluster. |
Week 7 Run:ai-orchestrated MIG (–mig-profile workload requests) | External A100 Run:ai cluster — only if the cluster admin has enabled Dynamic MIG on a node | Otherwise covered on LaunchPad alongside the Run:ai install lab. |
| Week 7 NGC container deployment | Anywhere | NGC steps work anywhere. The BCM AWS cluster is convenient since you're already there. |
Week 8 nvidia-smi / dcgmi / Docker troubleshooting | Either Spark or AWS | Both work. Spark is closer at hand. |
| Week 8 RDMA diagnostics, NCCL over RDMA, GPUDirect RDMA | Two-Spark RoCE link for the diagnostic command set; AWS EFA on the BCM cluster for a different RDMA flavor | Sparks: real ibstat/iblinkinfo/ib_write_bw output, NCCL reporting NET/IB over RoCE. AWS EFA: NCCL works (NCCL_PROTO, EFA-specific debug), but ibstat won't show Link layer: InfiniBand because EFA is its own transport. Neither is native IB; for that you need LaunchPad. |
For these, plan to use NVIDIA LaunchPad (request the labs early — they queue), or supplement with the simulator and documentation walkthroughs.
Link layer: InfiniBand in ibstat, has no Subnet Manager, and uses different diagnostic tools (fi_info, aws-efa-installer checks). DGX Spark is RoCE — RDMA over Ethernet, also not native IB. For LIDs, OpenSM, ibping, ibhosts, and ibnetdiscover topology discovery, you need LaunchPad DGX environments with real Quantum-2 IB. Cover the IB-specific topics from documentation; for hands-on, LaunchPad is the answer.
DGX Spark uses an ARM Grace CPU. Most modern NGC containers ship multi-arch images, but verify before each lab — docker manifest inspect <image> shows whether linux/arm64 is present. The symptom of a missing ARM variant is exec format error at container start. When in doubt for x86-only containers, fall back to your AWS-deployed nodes.
Before Week 1, confirm each of these works once so you don't lose a study session to setup:
BCM and AWS K8s cluster (primary environment):
cmsh on the BCM head node returns a prompt; you can list categories and nodes.cm-kubernetes-setup wizard has run successfully; kubectl get nodes from the K8s control-plane node shows control plane and all GPU workers Ready.kube-system and gpu-operator namespace pods are Running. kubectl describe node <gpu-worker> shows nvidia.com/gpu under Capacity and Allocatable.nvcr.io/nvidia/cuda:12.x-base requesting nvidia.com/gpu: 1) schedules, runs nvidia-smi, and exits cleanly.StorageClass is set; a test PVC reaches Bound state.p4d.24xlarge (A100) or p5.48xlarge (H100). G-series instances do not support MIG.External A100 Run:ai cluster (secondary environment):
runai login and the CLI is configured to point at the cluster.runai list projects shows at least one project you have submit rights on.runai submit test –image nvcr.io/nvidia/cuda:12.x-base -g 1 – nvidia-smi) schedules, runs, and completes — you can read the logs with runai logs test.–mig-profile workload requests there or need LaunchPad for that exercise.LaunchPad (for install + admin tasks you can't do locally):
Sparks (tertiary environment):
microk8s status shows ready; microk8s kubectl get nodes shows both nodes; microk8s kubectl describe node shows GPU resources advertised.ibstat shows the connected port State: Active, Physical state: LinkUp, Link layer: Ethernet. (Spark CX-7 is locked to Ethernet — that's expected, not a problem.) ibdev2netdev confirms which roce/enp interfaces are in use.ib_write_bw test between the two Sparks completes and reports a sane bandwidth number, confirming RDMA over RoCE is working.nccl-tests all_reduce_perf between the two Sparks completes with NCCL_DEBUG=INFO showing NET/IB transport selected (NCCL labels RoCE-mode RDMA as IB — that's expected).General:
ngc config set).Exam tasks covered:
This week is conceptual. You are building the mental map that every later task hangs from.
This week is mostly reading, but produce two artifacts:
Exam tasks covered:
dgx-a100, cpu-only) and node groups (administrative groupings).cmsh (CLI) and Base View (web GUI) both expose the same model.ipmi and out-of-band networks for power and console.In LaunchPad or a local BCM install:
cmsh: list categories (category list), show the image bound to your default category (category use default; show), and list nodes assigned to it.gpu-training, clone the image of an existing category into it, and assign one node to the new category. Reboot the node and confirm it provisions with the new image.network list). Identify which network the management interface uses vs. the high-speed fabric. If a DPU or BlueField device is present, show its interface objects.cmsh and cmgui/Base View? When would you reach for one over the other?ssh works. Where do you check next, and what does that ambiguity usually mean?Exam tasks covered:
This week is where BCM stops feeling like a GUI and starts feeling like an operational platform.
cm-chroot-sw-img), then rebooting nodes to pick up the new image. Coordinating firmware updates with BCM where applicable (DGX firmware, BMC firmware, NIC firmware) — when BCM orchestrates and when you drop to vendor tooling like nvfwupd or mlxfwmanager./var/log/cmdaemon and friends).cmd (CMDaemon) on the head node, cmd on compute nodes, the message bus between them. What it means when a node shows INSTALLER_CALLINGINIT vs. UP vs. DOWN vs. CLOSED. The request and event log streams.cmsh -c “main showprofile” for a real installation; reading a real profile teaches you what the permission model can express.htop) using cm-chroot-sw-img. Reboot a node assigned to that image and confirm the package is present.cmd on a single compute node. Watch how Base View represents the loss. Restart it. Inspect /var/log/cmdaemon on both head and compute to see the reconnect.CLOSED state and bring it back. Document the exact steps in your commands.md.INSTALLING. List five things you'd check, in order.cmd is healthy, but Base View shows half the cluster as down. What single networking fault explains this most often?Exam tasks covered:
Slurm is the historical heavyweight in HPC and remains the default choice for multi-node training. The exam expects fluency in submission, scheduling, and GRES configuration for GPUs.
slurmctld (controller, on the head node or a dedicated scheduler node), slurmd (one per compute node), slurmdbd (accounting database), munge for auth.gres.conf, Gres=gpu:8 in node definitions, requesting with –gres=gpu:N or –gpus=N, and the difference between those flags. CUDA device visibility is set by Slurm via CUDA_VISIBLE_DEVICES.train, inference, debug), QOS layers priority/limits on top, accounts/users let you do showback and quota. Fairshare and the multifactor priority plugin.SchedulerParameters knobs affect packing.sbatch script structure, srun for interactive and step launches, salloc for interactive allocations, –ntasks-per-node, –cpus-per-task, and how those interact with NCCL and mpirun/srun on multi-node jobs.sinfo, squeue, and scontrol show nodes work end-to-end.–gres=gpu:1 and prints nvidia-smi from inside the allocation. Confirm only the requested GPUs are visible.interactive (short, high priority) and train (long, lower priority). Configure a QOS that limits the interactive partition to jobs ≤ 1 hour.enroot import. Submit a single-node training job using srun –container-image (Pyxis) that runs a tiny model for one epoch. Confirm GPU utilization in nvidia-smi.nccl-tests. This is the muscle memory you want for the lab portion of the exam.slurmd on one node, observe sinfo showing the node as down, inspect slurmctld logs, restart, verify recovery.Before moving to the next week, run these in DC Lab Sim until each feels automatic:
sinfo — read partition state, identify drained vs. down vs. allocated nodes.squeue and squeue -u <user> — interpret job state codes (PD, R, CG, CD, F).scontrol show node <nodename>, scontrol show job <jobid>, scontrol update.sbatch and srun invocations with –gres=gpu:N, –ntasks-per-node, –time.squeue -l and translating them to fixes.–gres=gpu:8 and the job stays PENDING forever. List four causes and the diagnostic command for each.scontrol reconfigure not always sufficient after editing slurm.conf, and when do you need a full systemctl restart slurmctld?Exam tasks covered:
For inference and most modern serving stacks, Kubernetes is the default. NVIDIA's GPU Operator is the connective tissue between vanilla K8s and a GPU node.
kubeadm-based clusters, and how nodes are flipped between Slurm and K8s roles (or run both via co-existence).nvidia.com/gpu resource, requesting whole GPUs, requesting MIG slices via the MIG manager, the nvidia.com/mig-1g.5gb-style resources.kubectl muscle memory (get, describe, logs, exec, top), namespaces and RBAC, quotas, taints and tolerations to dedicate GPU nodes.docs.nvidia.com.Your environment: Run the exam-relevant work — BCM-deployed K8s, manual GPU Operator install, Triton/NIM, ResourceQuota — on your BCM-managed AWS K8s cluster (deployed viacm-kubernetes-setup). Your Sparks/MicroK8s are still useful for cheap iteration on basic concepts (kubectl reps, simple pod manifests), but the AWS cluster is what matches the exam's BCM-centric framing. The same AWS cluster will host Run:ai in Week 6 and MIG in Week 7, so investing in it here pays off.
kubectl get nodes shows both Ready.gpu-operator namespace. Verify with kubectl describe node that GPU resources are advertised.nvidia/cuda:12.x-base) requesting nvidia.com/gpu: 1. Confirm nvidia-smi works from inside.inference-team-a, set a ResourceQuota capping it at 4 GPUs, and verify a pod requesting 5 is rejected.uncordon it.nvidia.com/gpu: 1 and stays Pending. Walk through the diagnostic order: scheduler events → node resources → device plugin health → driver health.Exam tasks covered:
Run:ai is the orchestration layer that NVIDIA acquired in 2024 and integrated tightly into the Mission Control stack. It sits on top of Kubernetes and provides the multi-tenant GPU sharing and fractional GPU features that vanilla K8s does not.
runai submit) and the web UI both produce K8s objects via Run:ai's CRDs.run-ai-docs.nvidia.com — start with the Workloads, CLI Reference, and (separately) the Install Using Base Command Manager page.cm-kubernetes-setup wizard improvements that streamline Run:ai control plane installation.github.com/NVIDIA/KAI-Scheduler) — optional supplementary reading for the underlying scheduling concepts (gang, fairshare, preemption, reclaim). Not a substitute for Run:ai itself but useful for the why behind Run:ai's scheduling behavior.Your environment for this week is split:
- External A100 Run:ai cluster (user-level access): workload submission, project navigation, fractional GPU, possibly Dynamic MIG profile requests.
- LaunchPad Run:ai lab: install task end-to-end, plus any admin tasks (creating departments, projects, quotas, configuring node roles) that your role on the A100 cluster doesn't allow.
- Documentation study: read the BCM-Run:ai install flow as a runbook before you touch LaunchPad, so the LaunchPad session is execution rather than first-time learning.
Track 1 — Workload submission and use (on the external A100 Run:ai cluster):
runai login and confirm runai list projects shows your assigned projects. Submit a test training job (runai submit hello –image nvcr.io/nvidia/pytorch:24.x -g 1 – python -c “import torch; print(torch.cuda.is_available())”). Read runai logs hello.runai submit –interactive –jupyter. Connect to it via the URL Run:ai prints. Confirm GPU access from inside the notebook.–gpu-memory 20G against a single 80GB A100. Confirm both run concurrently. Inspect with runai describe job <name> to see how Run:ai represents the partial allocation.–mig-profile 2g.20gb and confirm Run:ai configures the slice. If Dynamic MIG isn't enabled on this cluster, skip this step here and do it on LaunchPad in Track 2.runai submit inference-test –image <triton-image> –port 8000:8000 –service-type loadbalancer). Send a sample inference request. Observe the Run:ai inference workload type's lifecycle (deployment vs. job).Track 2 — Install and admin (on LaunchPad):
cm-kubernetes-setup wizard's Run:ai installer step, the SaaS tenant configuration (URL, client ID, client secret), post-install verification (kubectl -n runai get pods, cluster registration in the SaaS UI). The act of writing this from documentation is most of the learning.research, production) and two projects under each. Set guaranteed GPU quotas and over-quota allowances per project.research/team-a that exceeds its guaranteed quota; have production/team-b claim its own quota; observe the over-quota job preempted. This is the “Allocate resources between teams with Run:ai” exam task in concrete form.runai-adm set node-role –dynamic-mig-enabled <node-name>. Confirm GPU Operator MIG strategy is mixed. Submit MIG-profile workloads.Track 3 — Cross-scheduler comparison:
Cost note: The BCM AWS GPU worker isn't strictly needed during Week 6 unless you want to do step 12(b). Consider keeping it down to control cost; bring it up for Week 7's MIG mechanics work.
runai-adm set node-role –dynamic-mig-enabled and the GPU Operator MIG strategy.)cm-kubernetes-setup need (FQDN, tenant, secret), what does it deploy, what should you verify post-install? If you can't answer this fluently, your runbook needs more time before LaunchPad.Exam tasks covered:
This week covers three skills that show up across multiple exam domains.
1g.5gb, 2g.10gb, 3g.20gb, 7g.40gb. Enabling MIG (nvidia-smi -mig 1), creating GPU instances and compute instances, and the exposure of MIG slices to Kubernetes via the GPU Operator MIG Manager (single-strategy vs. mixed-strategy).ngc CLI for authentication (API key), pulling images, and pushing to private registries. NGC private registries vs. the public catalog. Image signing and the NVIDIA AI Enterprise variants of common containers (validated, supported).nvidia.com/gpu resources, under Run:ai via job specs.Your environment for MIG: Same BCM-managed AWS K8s cluster you set up in Week 5, with at least one worker sized asp4d.24xlarge(A100) orp5.48xlarge(H100). This week's MIG lab focuses on the lower-level mechanics —nvidia-smi migdirectly, GPU Operator MIG Manager, non-Run:ai MIG-profile pods. The Run:ai-orchestrated MIG flow (if applicable) was covered in Week 6 on the external A100 cluster or LaunchPad. Step 3 (DOCA) requires a BlueField environment via LaunchPad. Steps 4–6 (NGC) work anywhere; the BCM AWS cluster is convenient since you're already there.
nvidia-smi directly. Enable MIG (nvidia-smi -mig 1), list profiles (nvidia-smi mig -lgip), create three 2g.20gb GPU instances, then create compute instances inside them. Verify with nvidia-smi. Disable MIG and reset.nvidia.com/mig.config=all-2g.20gb or similar). Watch the MIG Manager pod reconfigure the GPU. Submit a non-Run:ai pod requesting nvidia.com/mig-2g.20gb and confirm it lands.docs.nvidia.com end-to-end as a paper exercise.ngc CLI with your API key. Pull a PyTorch container, run it under Docker on a BCM-managed AWS node with –gpus all, and confirm GPU access.nvidia-smi -mig 1 / -mig 0 enable/disable cycle, nvidia-smi mig -lgip (list GPU instance profiles), nvidia-smi mig -cgi and -cci to create instances.docker run –gpus all and –gpus device=0,1 patterns.enroot import, enroot create, enroot start end-to-end, including importing a docker image.ngc CLI scenarios, drill the auth and pull flow.Exam tasks covered:
The exam's troubleshooting block expects fluent, decision-tree-style diagnosis. This week is about converting all the previous weeks' tooling into reflexive diagnostic patterns.
nvidia-smi — the basics, plus the underused flags: -q, dmon, pmon, nvlink -s, topo -m, –query-gpu for scripting.dcgmi diag, health checks, the DCGM exporter for Prometheus. Distinguish it from nvidia-smi (DCGM is the supported tool for cluster-scale telemetry and validation).nvidia-bug-report.sh — when escalating, NVIDIA support will ask for this output.nvidia-smi nvlink -s.NCCL_DEBUG=INFO to surface topology and transport choice, ibstat and ibdiagnet for InfiniBand health, GPUDirect Storage probe utilities for storage path validation.fio for raw bandwidth, gdscheck and gds_stats for GPUDirect Storage path, file system-specific tools (Lustre lfs check, Weka diagnostics, etc.).docker logs, docker inspect, docker events, container runtime logs (journalctl -u containerd), the NVIDIA Container Toolkit logs at /var/log/nvidia-container-toolkit.log, and common failures: missing toolkit, missing driver mounts, capability mismatches.dcgmi diag and the diagnostics levels.Your environment: You have two venues. The BCM-managed AWS cluster (your Run:ai/MIG/inference home) is best for cluster-scale troubleshooting — multi-node NCCL with EFA, BCM-detected node faults, GPU Operator pod failures, real Xid errors on A100/H100. The two-Spark RoCE link is best for the IB-style diagnostic command set —ibstat,iblinkinfo,ib_write_bw/ib_read_bw,ibdev2netdev. Pick the venue per scenario:
- Scenario 1 (NCCL hang): try on both. Sparks give youibstat/iblinkinfooutput to read; AWS gives you a multi-node EFA-flavored hang to diagnose with NCCL debug environment variables.
- Scenarios 2 and 4 (nvidia-smi ERR!, containerNo devices found): either venue.
- Scenario 3 (NGC pull failures): AWS — you can manipulate VPC routes, security groups, and registry endpoints.
- Scenario 5 (storage benchmarks): AWS — variety of backends (EFS, FSx for Lustre, EBS) gives you real IO-path problems to solve.
- Scenario 6 (Fabric Manager): LaunchPad DGX only. Neither of your environments has NVSwitch.
Bonus drills your RoCE-connected Sparks support (do these — they're high-yield):
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET,GRAPH. Read the topology output, identify which transport NCCL chose, and force the alternative with NCCL_IB_DISABLE=1 to see the bandwidth difference between RDMA and TCP.ip link set <iface> down. Watch NCCL behavior, recover, document the symptom-to-cause path.ib_write_bw and ib_read_bw from the perftest package between the two Sparks. These are the same tools you'd use on a true IB fabric for storage and RDMA bandwidth troubleshooting.nvidia-smi plus the NCCL debug output should confirm GPU memory is reaching the NIC without a host bounce.
What these drills do not cover (study from documentation): Subnet Manager workflows, LID assignment and routing, ibping/ibhosts/ibnetdiscover topology discovery, partition keys, IB-specific QoS. The DGX Spark cannot be configured for native IB, so these topics are paper-only on this hardware.
Pick each scenario below and write a one-page runbook. Then practice executing it.
NCCL_DEBUG=INFO, then ibstat/ibdiagnet, then Fabric Manager status, then NVLink status, then check whether one node is stuck.nvidia-smi returns ERR! on one GPU.” Walk through Xid in dmesg, GPU reset, driver reinstall, hardware replacement.nvidia-smi inside reports No devices found.“ Toolkit configuration, runtime selection (containerd default runtime), –gpus/device plugin annotation, driver visibility.This is the week where DC Lab Sim earns its keep. Drill until each command's flags are reflex:
nvidia-smi deep dive: nvidia-smi, nvidia-smi -q, nvidia-smi -q -d ECC, nvidia-smi dmon, nvidia-smi pmon, nvidia-smi nvlink -s, nvidia-smi nvlink -e, nvidia-smi topo -m, nvidia-smi –query-gpu=… for scripted output.dcgmi: dcgmi discovery -l, dcgmi diag -r 1 (quick), dcgmi diag -r 2 (medium), dcgmi diag -r 3 (long), dcgmi health, dcgmi stats.ibstat, iblinkinfo, ibhosts, ibnetdiscover. Practice reading link state, rate, and counters.ipmitool: ipmitool sel list (reading the system event log is a foundational troubleshooting skill), ipmitool sensor, ipmitool chassis status. Even though IPMI leans more AII, the diagnostic mindset transfers.
Caveat for AIO: the simulator's troubleshooting scenarios are framed around AII's domains (cabling, transceivers, burn-in failures). For AIO-specific troubleshooting like Run:ai job preemption, K8s scheduler events, or BCM cmd daemon failures, you'll need real LaunchPad time or a self-built lab — the simulator won't cover those.
nvidia-smi, dcgmi, nvsm, nvidia-bug-report.sh?By this point you've covered every task on the blueprint. The last stretch is about reflexes, gaps, and exam mechanics.
commands.md. The exam includes hands-on labs in a real cluster environment; muscle memory on cmsh, sbatch, kubectl, and runai submit is what saves you when nerves hit.DOWN back to UP in BCM (assume image is fine, network is fine, cmd needs a kick).nvidia-smi vs. dcgmi, GPU Operator vs. Container Toolkit).If your mock-exam scores are consistently below 75%, push the exam date out two weeks rather than schedule and hope. The recertification cost is the exam fee again.
If you ever lose track of where a topic lives in this plan:
| Exam task | Week |
|---|---|
| Describe Mission Control toolkit | 1 |
| Describe data center architecture for AI workloads | 1 |
| BCM Base View monitoring | 2 |
| BCM network config (nodes, DPUs, switches) | 2 |
| BCM compute node categories | 2 |
| BCM patches, firmware, image sync | 3 |
| BCM users, roles, permissions | 3 |
| BCM cluster issue diagnosis | 3 |
| BCM documentation and reports | 3 |
| Troubleshoot BCM | 3 |
| Install Slurm | 4 |
| Administer Slurm cluster | 4 |
| Manage scheduling/allocation via BCM (Slurm) | 4 |
| Deploy training workloads with Slurm | 4 |
| Install/initialize Kubernetes via BCM | 5 |
| Administer Kubernetes | 5 |
| Manage scheduling/allocation via BCM (K8s) | 5 |
| Deploy inference workloads with Kubernetes | 5 |
| Install Run:ai | 6 |
| Administer Run:ai | 6 |
| Deploy training workloads with Run:ai | 6 |
| Deploy inference workloads with Run:ai | 6 |
| Resource allocation across Run:ai/Slurm/K8s | 6 |
| Configure MIG | 7 |
| Deploy DOCA Services on DPU Arm | 7 |
| Deploy containers from NGC | 7 |
| Use system management tools to troubleshoot | 8 |
| Troubleshoot Docker | 8 |
| Troubleshoot Fabric Manager (NVLink/NVSwitch) | 8 |
| Troubleshoot Magnum IO | 8 |
| Troubleshoot storage performance | 8 |
| Troubleshoot NGC container deployment | 8 |
docs.nvidia.com/base-command-managerdocs.nvidia.com/ngcdocs.nvidia.com/datacenter/dcgmdocs.nvidia.com/docadocs.run.aislurm.schedmd.com/documentation.htmlGood luck. The exam is hands-on by design — if you've actually done every lab in this plan, the test will feel familiar rather than scary.