Table of Contents

NVIDIA NCP-AIO Training Plan

Certification: NVIDIA-Certified Professional: AI Operations (NCP-AIO)
Format: 30 multiple-choice questions + 3 hands-on lab exercises, 120 minutes total
Cost: $500 USD, online proctored
Recommended experience: 2–3 years of data center work with NVIDIA hardware.
Plan length: 8 weeks @ ~8–10 hours/week (≈70 hours total)

Topics Areas % of Exam Topics Covered
Installation and Deployment 31% * Describe the Mission Control toolkit
* Use BCM’s Base View interface to monitor cluster performance, resource utilization, and node health in real time.
* Manage job scheduling and resource allocation using BCM’s workload manager (e.g., SLURM or Kubernetes)
* Apply patches, update firmware, and synchronize software images across cluster nodes using BCM
* Administer user accounts, roles, and permissions to ensure secure access to the cluster using BCM
* Configure and monitor network settings for cluster nodes, DPUs, and switches using BCM
* Diagnose and resolve cluster issues, such as job failures, node outages, or resource bottlenecks, using BCM.
* Use BCM to organize and configure compute nodes into categories based on hardware or workload requirements.
* Using BCM, maintain documentation and generate reports on cluster usage, performance, and issues.
* Install and initialize Kubernetes on NVIDIA hosts using BCM
* Deploy DOCA Services on DPU Arm
* Install Run:ai
* Install Slurm
Administration 23% * Administer Slurm cluster.
* Describe data center architecture for AI Workloads
* Administer Run:ai
* Administer Kubernetes
* Configure MIG
Workload Management 23% * Deploy inference workloads with Kubernetes
* Deploy inference workloads with Run:ai
* Deploy training workloads with Slurm
* Deploy training workloads with Run:ai
* Use system management tools to troubleshoot issues\ \* Allocate resources between teams with Run:ai, Slurm and Kubernetes
* Deploy containers from NGC
Troubleshooting and Optimization 23% * Troubleshoot Docker
* Troubleshoot the fabric manager service for NVLink and NVSwitch systems
* Troubleshoot Base Command Manager\\. * Troubleshoot Magnum IO components
* Troubleshoot storage performance
* Troubleshoot the deployment of a container from NGC

How to Use This Plan

Each week is built around a tight cluster of related exam tasks. Every module has four parts:

  1. Objectives — the exact NCP-AIO tasks the week maps to.
  2. Concepts to master — the technical content you need fluency in.
  3. Resources — official NVIDIA docs, courses, and open-source references.
  4. Hands-on lab — a practical exercise. The exam has a live cluster lab section, so command-line muscle memory matters more than reading.
  5. Self-check — questions and tasks to confirm you can perform, not just recognize.

If you do not have hardware, NVIDIA LaunchPad provides free, time-boxed access to real NVIDIA stacks (BCM, DGX, BlueField, Run:ai). Pair it with a small cloud GPU instance (one A100 or L40S is enough for MIG and single-node practice).

The exam's four domains and approximate weights:

Domain Weight
Administration ~30%
Workload Management ~20%
Installation & Deployment ~30%
Troubleshooting ~20%

Pre-Work (Days 1–3, before Week 1)

Before starting, confirm your baseline. If any of these feel shaky, address them first.

Pre-work resources:

Set up your environment now:

Note on DC Lab Sim (dclabsim.com / Seanbo5386/dc-lab-sim)

If you have DC Lab Sim installed, treat it as a high-value but partial supplement to this plan. Important context:

Where it directly supports this AIO plan:

Where it is silent on AIO content (do not rely on the simulator for these — use LaunchPad, real labs, and docs):

In the weeks below, look for Simulator drills callouts where it earns its keep.


Your Lab Environment Mapping

You have three complementary environments:

Primary: BCM-managed hybrid cluster.

This is where Weeks 2, 3, 4, 5, and 7 happen.

Secondary: External A100 cluster with Run:ai pre-installed.

This is where Week 6 hands-on workload submission and as much administration as your role permits happen.

Tertiary: Two DGX Sparks running MicroK8s, connected back-to-back via QSFP112 in RoCE mode. GB10 Grace Blackwell, 128 GB unified memory each, ConnectX-7 at 200 Gb/s. The Spark's CX-7 is firmware-locked to Ethernet (per the official DGX Spark User Guide: “The DGX Spark CX-7 ports support ethernet configuration only”). The Sparks give you real RDMA, GPUDirect RDMA, and NCCL diagnostics over RoCE — useful for cheap K8s iteration in Week 5 and Week 8 RoCE-flavored troubleshooting. Explicitly not supported by Run:ai (DGX Spark is on the Run:ai unsupported list per official docs).

This covers the bulk of the plan with real hardware. Below is what fits where, and where to substitute.

What to run where

Plan content Run on Notes
Week 2–3 BCM (Base View, cmsh, categories, images, users, reports) BCM head node + AWS-deployed compute nodes Best possible setup. Use cheap g5.xlarge or g6.xlarge nodes for image and category practice; bring up p4d/p5 only for the workload-heavy exercises.
Week 4 Slurm install + multi-node training BCM + AWS GPU nodes Provision two GPU nodes, install Slurm with Enroot/Pyxis through BCM's Slurm role. Cross-node NCCL works over AWS EFA.
Week 5 Kubernetes admin + inference Primary: BCM-managed AWS K8s cluster (deployed via cm-kubernetes-setup). Optional alternative: two-Spark MicroK8s for cheap, fast iteration on basic kubectl/GPU Operator concepts. Do the exam-relevant K8s administration work — BCM-deployed K8s, GPU Operator manual install, Triton/NIM inference deployments, ResourceQuotas — on the BCM AWS cluster. Use Sparks only for tinkering on concepts you can test cheaply.
Week 6 Run:ai workload submission, project navigation, fractional GPU, MIG profile requests External A100 Run:ai cluster Submit workloads via runai submit, navigate UI within your role's visibility, exercise fractional GPU and (if cluster admin has enabled it) Dynamic MIG profile requests. Request observer/department-admin access from the cluster owner if possible — closes most of the admin-task gap.
Week 6 Run:ai install + cluster-level admin (creating departments, projects, quotas, node roles) LaunchPad Run:ai lab + documentation study The install task is on the exam. You cannot install Run:ai on your BCM AWS cluster (no license) and you cannot install on the A100 cluster (already installed by someone else). LaunchPad is the only hands-on install venue. Document the install flow as your own runbook from BCM/Run:ai docs first, then execute it once on LaunchPad.
Week 7 MIG (low-level nvidia-smi mig, GPU Operator MIG Manager mode switching) BCM-managed AWS K8s cluster (size at least one worker as p4d.24xlarge for A100 or p5.48xlarge for H100) Needs shell access to the GPU node and admin access to the K8s cluster — both of which you have on your BCM AWS cluster.
Week 7 Run:ai-orchestrated MIG (–mig-profile workload requests) External A100 Run:ai cluster — only if the cluster admin has enabled Dynamic MIG on a node Otherwise covered on LaunchPad alongside the Run:ai install lab.
Week 7 NGC container deployment Anywhere NGC steps work anywhere. The BCM AWS cluster is convenient since you're already there.
Week 8 nvidia-smi / dcgmi / Docker troubleshooting Either Spark or AWS Both work. Spark is closer at hand.
Week 8 RDMA diagnostics, NCCL over RDMA, GPUDirect RDMA Two-Spark RoCE link for the diagnostic command set; AWS EFA on the BCM cluster for a different RDMA flavor Sparks: real ibstat/iblinkinfo/ib_write_bw output, NCCL reporting NET/IB over RoCE. AWS EFA: NCCL works (NCCL_PROTO, EFA-specific debug), but ibstat won't show Link layer: InfiniBand because EFA is its own transport. Neither is native IB; for that you need LaunchPad.

Gaps your environment cannot cover

For these, plan to use NVIDIA LaunchPad (request the labs early — they queue), or supplement with the simulator and documentation walkthroughs.

  1. DOCA Services on DPU Arm (Week 7). No BlueField DPU in Spark or in standard AWS GPU instances. LaunchPad's BlueField environment is the realistic option — request it in advance. Without LaunchPad, this becomes a paper exercise from the DOCA documentation; the DC Lab Sim simulator will not help here either.
  1. Native InfiniBand experience (Weeks 4 and 8). Neither environment gives you a true switched IB fabric. AWS uses EFA — a different RDMA transport that NCCL works over but that doesn't surface Link layer: InfiniBand in ibstat, has no Subnet Manager, and uses different diagnostic tools (fi_info, aws-efa-installer checks). DGX Spark is RoCE — RDMA over Ethernet, also not native IB. For LIDs, OpenSM, ibping, ibhosts, and ibnetdiscover topology discovery, you need LaunchPad DGX environments with real Quantum-2 IB. Cover the IB-specific topics from documentation; for hands-on, LaunchPad is the answer.
  1. Fabric Manager (Week 8). Fabric Manager is the NVLink/NVSwitch daemon. Neither AWS GPU nodes nor Sparks have NVSwitch — Spark has on-die NVLink C2C between Grace and Blackwell (no switched fabric), and standard AWS GPU instances aren't NVSwitch-equipped (DGX Cloud SuperPOD-class instances are, but those aren't what you're spinning up). For real Fabric Manager troubleshooting drills you need a DGX A100/H100/H200/GB200 — LaunchPad.

ARM caveat for Spark labs

DGX Spark uses an ARM Grace CPU. Most modern NGC containers ship multi-arch images, but verify before each lab — docker manifest inspect <image> shows whether linux/arm64 is present. The symptom of a missing ARM variant is exec format error at container start. When in doubt for x86-only containers, fall back to your AWS-deployed nodes.

One-time prep checklist for your environment

Before Week 1, confirm each of these works once so you don't lose a study session to setup:

BCM and AWS K8s cluster (primary environment):

External A100 Run:ai cluster (secondary environment):

LaunchPad (for install + admin tasks you can't do locally):

Sparks (tertiary environment):

General:


Week 1 — Foundations: Data Center Architecture & Mission Control

Exam tasks covered:

This week is conceptual. You are building the mental map that every later task hangs from.

Concepts to master

Resources

Hands-on lab

This week is mostly reading, but produce two artifacts:

  1. A one-page diagram of a generic AI factory showing all four planes, with the names of the NVIDIA components that live on each.
  2. A table mapping each Mission Control component to the operational task it owns (provisioning, scheduling, observability, lifecycle).

Self-check


Week 2 — BCM Fundamentals: Base View, Nodes, and Networks

Exam tasks covered:

Concepts to master

Resources

Hands-on lab

In LaunchPad or a local BCM install:

  1. Open Base View. Locate the cluster overview, drill into a single compute node, and find: GPU utilization, memory pressure, NIC throughput, and the most recent failed health check.
  2. In cmsh: list categories (category list), show the image bound to your default category (category use default; show), and list nodes assigned to it.
  3. Create a new category called gpu-training, clone the image of an existing category into it, and assign one node to the new category. Reboot the node and confirm it provisions with the new image.
  4. Add a custom metric threshold (e.g., GPU temperature > 80°C raises a warning) and verify the alert appears in Base View.
  5. Inspect the network objects (network list). Identify which network the management interface uses vs. the high-speed fabric. If a DPU or BlueField device is present, show its interface objects.

Self-check


Week 3 — BCM Advanced: Lifecycle, Users, Reports, and Troubleshooting

Exam tasks covered:

This week is where BCM stops feeling like a GUI and starts feeling like an operational platform.

Concepts to master

Resources

Hands-on lab

  1. Inside a software image, install a new package (e.g., htop) using cm-chroot-sw-img. Reboot a node assigned to that image and confirm the package is present.
  2. Create a new BCM admin user with a custom profile that can reboot nodes and view monitoring data, but cannot modify networks or images. Log in as that user and verify the restrictions hold.
  3. Generate a cluster utilization report covering the last 7 days. Export it as PDF or CSV.
  4. Stop cmd on a single compute node. Watch how Base View represents the loss. Restart it. Inspect /var/log/cmdaemon on both head and compute to see the reconnect.
  5. Force a node into CLOSED state and bring it back. Document the exact steps in your commands.md.

Self-check


Week 4 — Slurm: Install, Administer, and Run Training Workloads

Exam tasks covered:

Slurm is the historical heavyweight in HPC and remains the default choice for multi-node training. The exam expects fluency in submission, scheduling, and GRES configuration for GPUs.

Concepts to master

Resources

Hands-on lab

  1. Install Slurm into your BCM cluster. Confirm sinfo, squeue, and scontrol show nodes work end-to-end.
  2. Configure GRES on at least one GPU node. Submit a job that requests –gres=gpu:1 and prints nvidia-smi from inside the allocation. Confirm only the requested GPUs are visible.
  3. Create two partitions, interactive (short, high priority) and train (long, lower priority). Configure a QOS that limits the interactive partition to jobs ≤ 1 hour.
  4. Pull an NGC PyTorch container with enroot import. Submit a single-node training job using srun –container-image (Pyxis) that runs a tiny model for one epoch. Confirm GPU utilization in nvidia-smi.
  5. Run a 2-node NCCL all-reduce test under Slurm. Confirm bandwidth numbers via nccl-tests. This is the muscle memory you want for the lab portion of the exam.
  6. Practice troubleshooting: kill slurmd on one node, observe sinfo showing the node as down, inspect slurmctld logs, restart, verify recovery.

Simulator drills (DC Lab Sim)

Before moving to the next week, run these in DC Lab Sim until each feels automatic:

Self-check


Week 5 — Kubernetes: Install via BCM, Administer, Run Inference

Exam tasks covered:

For inference and most modern serving stacks, Kubernetes is the default. NVIDIA's GPU Operator is the connective tissue between vanilla K8s and a GPU node.

Concepts to master

Resources

Hands-on lab

Your environment: Run the exam-relevant work — BCM-deployed K8s, manual GPU Operator install, Triton/NIM, ResourceQuota — on your BCM-managed AWS K8s cluster (deployed via cm-kubernetes-setup). Your Sparks/MicroK8s are still useful for cheap iteration on basic concepts (kubectl reps, simple pod manifests), but the AWS cluster is what matches the exam's BCM-centric framing. The same AWS cluster will host Run:ai in Week 6 and MIG in Week 7, so investing in it here pays off.
  1. Use BCM to bring up a K8s control plane on one node and a worker on another. Confirm kubectl get nodes shows both Ready.
  2. Install the GPU Operator via Helm. Watch all the pods come up in gpu-operator namespace. Verify with kubectl describe node that GPU resources are advertised.
  3. Deploy a sample CUDA pod (nvidia/cuda:12.x-base) requesting nvidia.com/gpu: 1. Confirm nvidia-smi works from inside.
  4. Deploy Triton Inference Server with a sample model. Issue an inference request from a client pod. Inspect Triton's metrics endpoint.
  5. Create a namespace inference-team-a, set a ResourceQuota capping it at 4 GPUs, and verify a pod requesting 5 is rejected.
  6. Drain a GPU node, confirm the workloads reschedule, then uncordon it.

Self-check


Week 6 — Run:ai: Install, Administer, and Multi-Tenant Resource Allocation

Exam tasks covered:

Run:ai is the orchestration layer that NVIDIA acquired in 2024 and integrated tightly into the Mission Control stack. It sits on top of Kubernetes and provides the multi-tenant GPU sharing and fractional GPU features that vanilla K8s does not.

Concepts to master

Resources

Hands-on lab

Your environment for this week is split:
- External A100 Run:ai cluster (user-level access): workload submission, project navigation, fractional GPU, possibly Dynamic MIG profile requests.
- LaunchPad Run:ai lab: install task end-to-end, plus any admin tasks (creating departments, projects, quotas, configuring node roles) that your role on the A100 cluster doesn't allow.
- Documentation study: read the BCM-Run:ai install flow as a runbook before you touch LaunchPad, so the LaunchPad session is execution rather than first-time learning.

Track 1 — Workload submission and use (on the external A100 Run:ai cluster):

  1. runai login and confirm runai list projects shows your assigned projects. Submit a test training job (runai submit hello –image nvcr.io/nvidia/pytorch:24.x -g 1 – python -c “import torch; print(torch.cuda.is_available())”). Read runai logs hello.
  2. Submit an interactive workspace (Jupyter) using runai submit –interactive –jupyter. Connect to it via the URL Run:ai prints. Confirm GPU access from inside the notebook.
  3. Fractional GPU exercise. Submit two workloads each requesting –gpu-memory 20G against a single 80GB A100. Confirm both run concurrently. Inspect with runai describe job <name> to see how Run:ai represents the partial allocation.
  4. Dynamic MIG exercise (if available). If the cluster admin has enabled Dynamic MIG on a node, submit a workload with –mig-profile 2g.20gb and confirm Run:ai configures the slice. If Dynamic MIG isn't enabled on this cluster, skip this step here and do it on LaunchPad in Track 2.
  5. Submit an inference workload (runai submit inference-test –image <triton-image> –port 8000:8000 –service-type loadbalancer). Send a sample inference request. Observe the Run:ai inference workload type's lifecycle (deployment vs. job).
  6. Navigate the Run:ai UI within your role's visibility: your projects, your job history, your quota usage. Note what you can't see — that's your gap to fill in Track 2.

Track 2 — Install and admin (on LaunchPad):

  1. Before the LaunchPad session, write a runbook from the BCM and Run:ai docs covering: prerequisites (K8s version, GPU Operator version, Ingress, Prometheus, cert-manager), the cm-kubernetes-setup wizard's Run:ai installer step, the SaaS tenant configuration (URL, client ID, client secret), post-install verification (kubectl -n runai get pods, cluster registration in the SaaS UI). The act of writing this from documentation is most of the learning.
  2. In the LaunchPad Run:ai lab, execute your runbook end-to-end. Where the lab environment differs from BCM, note the differences in your runbook. The exam tests the BCM-installer flow specifically.
  3. In LaunchPad, create two departments (research, production) and two projects under each. Set guaranteed GPU quotas and over-quota allowances per project.
  4. Cross-team allocation exercise. Submit a job under research/team-a that exceeds its guaranteed quota; have production/team-b claim its own quota; observe the over-quota job preempted. This is the “Allocate resources between teams with Run:ai” exam task in concrete form.
  5. Configure a node for Dynamic MIG: runai-adm set node-role –dynamic-mig-enabled <node-name>. Confirm GPU Operator MIG strategy is mixed. Submit MIG-profile workloads.

Track 3 — Cross-scheduler comparison:

  1. Take a single training job spec and submit equivalent versions to: (a) Slurm on your BCM-managed AWS Slurm cluster from Week 4, (b) plain Kubernetes on the BCM K8s cluster (as a non-Run:ai pod), © Run:ai on the external A100 cluster. Document the resource allocation flow in each. The exam directly tests “Allocate resources between teams with Run:ai, Slurm, and Kubernetes.”
Cost note: The BCM AWS GPU worker isn't strictly needed during Week 6 unless you want to do step 12(b). Consider keeping it down to control cost; bring it up for Week 7's MIG mechanics work.

Self-check


Week 7 — GPU, DPU, and Container Specifics: MIG, DOCA, and NGC

Exam tasks covered:

This week covers three skills that show up across multiple exam domains.

Concepts to master

Resources

Hands-on lab

Your environment for MIG: Same BCM-managed AWS K8s cluster you set up in Week 5, with at least one worker sized as p4d.24xlarge (A100) or p5.48xlarge (H100). This week's MIG lab focuses on the lower-level mechanics — nvidia-smi mig directly, GPU Operator MIG Manager, non-Run:ai MIG-profile pods. The Run:ai-orchestrated MIG flow (if applicable) was covered in Week 6 on the external A100 cluster or LaunchPad. Step 3 (DOCA) requires a BlueField environment via LaunchPad. Steps 4–6 (NGC) work anywhere; the BCM AWS cluster is convenient since you're already there.
  1. On the A100 or H100 worker, drop down to nvidia-smi directly. Enable MIG (nvidia-smi -mig 1), list profiles (nvidia-smi mig -lgip), create three 2g.20gb GPU instances, then create compute instances inside them. Verify with nvidia-smi. Disable MIG and reset.
  2. With the GPU Operator MIG Manager installed, switch the node from full-GPU mode to MIG mixed-strategy by labeling the node (nvidia.com/mig.config=all-2g.20gb or similar). Watch the MIG Manager pod reconfigure the GPU. Submit a non-Run:ai pod requesting nvidia.com/mig-2g.20gb and confirm it lands.
  3. If LaunchPad has a BlueField environment available, log into the DPU's Arm OS, list installed DOCA services, and deploy a sample DOCA service container. If no DPU is available, work through the DOCA Services tutorial on docs.nvidia.com end-to-end as a paper exercise.
  4. Configure the ngc CLI with your API key. Pull a PyTorch container, run it under Docker on a BCM-managed AWS node with –gpus all, and confirm GPU access.
  5. Pull an NGC Helm chart for Triton, customize values, and deploy to your BCM-managed K8s cluster.
  6. Practice the same NGC container running under all three runtimes: Docker (on a BCM AWS node), Enroot/Slurm (on your Week 4 Slurm cluster), Kubernetes (on the BCM K8s cluster). Note differences in invocation.

Simulator drills (DC Lab Sim)

Self-check


Week 8 — Troubleshooting Deep Dive

Exam tasks covered:

The exam's troubleshooting block expects fluent, decision-tree-style diagnosis. This week is about converting all the previous weeks' tooling into reflexive diagnostic patterns.

Concepts to master

Resources

Hands-on lab — build your decision trees

Your environment: You have two venues. The BCM-managed AWS cluster (your Run:ai/MIG/inference home) is best for cluster-scale troubleshooting — multi-node NCCL with EFA, BCM-detected node faults, GPU Operator pod failures, real Xid errors on A100/H100. The two-Spark RoCE link is best for the IB-style diagnostic command set — ibstat, iblinkinfo, ib_write_bw/ib_read_bw, ibdev2netdev. Pick the venue per scenario:

- Scenario 1 (NCCL hang): try on both. Sparks give you ibstat/iblinkinfo output to read; AWS gives you a multi-node EFA-flavored hang to diagnose with NCCL debug environment variables.
- Scenarios 2 and 4 (nvidia-smi ERR!, container No devices found): either venue.
- Scenario 3 (NGC pull failures): AWS — you can manipulate VPC routes, security groups, and registry endpoints.
- Scenario 5 (storage benchmarks): AWS — variety of backends (EFS, FSx for Lustre, EBS) gives you real IO-path problems to solve.
- Scenario 6 (Fabric Manager): LaunchPad DGX only. Neither of your environments has NVSwitch.

Bonus drills your RoCE-connected Sparks support (do these — they're high-yield):

What these drills do not cover (study from documentation): Subnet Manager workflows, LID assignment and routing, ibping/ibhosts/ibnetdiscover topology discovery, partition keys, IB-specific QoS. The DGX Spark cannot be configured for native IB, so these topics are paper-only on this hardware.

Pick each scenario below and write a one-page runbook. Then practice executing it.

  1. “Distributed training hangs on AllReduce.” Where do you look first? NCCL_DEBUG=INFO, then ibstat/ibdiagnet, then Fabric Manager status, then NVLink status, then check whether one node is stuck.
  2. nvidia-smi returns ERR! on one GPU.” Walk through Xid in dmesg, GPU reset, driver reinstall, hardware replacement.
  3. “NGC container fails to pull on three nodes but succeeds on five.” Auth, network, NGC org/team, image tag pinning.
  4. “Container starts but nvidia-smi inside reports No devices found. Toolkit configuration, runtime selection (containerd default runtime), –gpus/device plugin annotation, driver visibility.
  5. “Storage benchmark shows 1 GB/s when the spec says 50 GB/s.” Single client vs. parallel, GPUDirect Storage enabled, mount options, file system config, network path.
  6. “Fabric Manager fails to start after a driver upgrade.” Version pinning, package state, log inspection, service order on boot.

Simulator drills (DC Lab Sim) — heaviest overlap of the plan

This is the week where DC Lab Sim earns its keep. Drill until each command's flags are reflex:

Caveat for AIO: the simulator's troubleshooting scenarios are framed around AII's domains (cabling, transceivers, burn-in failures). For AIO-specific troubleshooting like Run:ai job preemption, K8s scheduler events, or BCM cmd daemon failures, you'll need real LaunchPad time or a self-built lab — the simulator won't cover those.

Self-check


Final Week — Mock Exam, Lab Drills, and Test-Day Logistics

By this point you've covered every task on the blueprint. The last stretch is about reflexes, gaps, and exam mechanics.

Mock exam regimen

Lab drill targets (do all of these in under 15 minutes each)

Test-day logistics

Two-week buffer rule

If your mock-exam scores are consistently below 75%, push the exam date out two weeks rather than schedule and hope. The recertification cost is the exam fee again.


Quick-Reference: Task → Week Map

If you ever lose track of where a topic lives in this plan:

Exam task Week
Describe Mission Control toolkit 1
Describe data center architecture for AI workloads 1
BCM Base View monitoring 2
BCM network config (nodes, DPUs, switches) 2
BCM compute node categories 2
BCM patches, firmware, image sync 3
BCM users, roles, permissions 3
BCM cluster issue diagnosis 3
BCM documentation and reports 3
Troubleshoot BCM 3
Install Slurm 4
Administer Slurm cluster 4
Manage scheduling/allocation via BCM (Slurm) 4
Deploy training workloads with Slurm 4
Install/initialize Kubernetes via BCM 5
Administer Kubernetes 5
Manage scheduling/allocation via BCM (K8s) 5
Deploy inference workloads with Kubernetes 5
Install Run:ai 6
Administer Run:ai 6
Deploy training workloads with Run:ai 6
Deploy inference workloads with Run:ai 6
Resource allocation across Run:ai/Slurm/K8s 6
Configure MIG 7
Deploy DOCA Services on DPU Arm 7
Deploy containers from NGC 7
Use system management tools to troubleshoot 8
Troubleshoot Docker 8
Troubleshoot Fabric Manager (NVLink/NVSwitch) 8
Troubleshoot Magnum IO 8
Troubleshoot storage performance 8
Troubleshoot NGC container deployment 8

Core Resources at a Glance

Good luck. The exam is hands-on by design — if you've actually done every lab in this plan, the test will feel familiar rather than scary.

AI Cloud Managed Services Policies and Procedures