Differences

This shows you the differences between two versions of the page.

--- wiki:ai:ncp-aio-lab-training [2026/05/18 16:08] – [1.1 Getting into BCM] ddehamer
+++ wiki:ai:ncp-aio-lab-training [2026/05/18 16:32] (current) – ddehamer
@@ Line 19: / Line 19: @@
 [head]% help
 [head]% main             # return to the top-level prompt
-[head->main]% licenseinfo
+[head]% status           # overall cluster status (services, license, HA)
-[head->main]% partition use base
+[head]% partition use base
 </code>
@@ Line 678: / Line 678: @@
 | Multi-node training slow               | ''NCCL_DEBUG=INFO''                      | IB link state, topo, GPU<->NIC                  |
 | ECC errors / Xid 48/63/64              | ''nvidia-smi -q -d ECC''                 | Drain node, page retirement                     |
+----
+===== 5. Docker & Container Runtimes =====
+Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/Deployment and Troubleshooting. The exam treats Docker as the **substrate** — what gets tested is "does this container have GPU access?" and "how do I diagnose it when something's wrong?" Know the runtime layer for standalone GPU servers (Docker) and for Kubernetes nodes (containerd).
+==== 5.1 The install order — memorize it ====
+  - **NVIDIA driver** (host kernel module)
+  - **Docker** (or containerd for k8s nodes)
+  - **NVIDIA Container Toolkit** (provides the ''nvidia'' runtime shim)
+  - **Configure the runtime** to advertise the toolkit
+<code bash>
+# Configure Docker to use the nvidia runtime
+sudo nvidia-ctk runtime configure --runtime=docker
+sudo systemctl restart docker
+# Configure containerd (for Kubernetes nodes)
+sudo nvidia-ctk runtime configure --runtime=containerd
+sudo systemctl restart containerd
+# Verify the runtime is wired in
+docker info | grep -i runtime
+cat /etc/containerd/config.toml | grep -A5 nvidia
+</code>
+==== 5.2 NGC registry authentication ====
+Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from ''nvcr.io'' requires auth. **This is the classic trip-up question on the exam.**
+<code bash>
+docker login nvcr.io
+# Username: $oauthtoken           <-- LITERAL string, with the dollar sign
+# Password: <your NGC API key>    <-- from ngc.nvidia.com
+# Or non-interactive (for scripts / cmsh image builds)
+echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
+# Then pull
+docker pull nvcr.io/nvidia/pytorch:24.10-py3
+docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
+docker pull nvcr.io/nvidia/tensorrt:24.10-py3
+</code>
+The username is literally ''%%$oauthtoken%%'' as a string — not your NVIDIA email, not a shell variable. Memorize this.
+==== 5.3 Running GPU containers ====
+<code bash>
+# Verify GPU access from a container (the smoke test)
+docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
+# GPU selection flags — know all the forms
+docker run --gpus all ...                            # all GPUs
+docker run --gpus 2 ...                              # any 2 GPUs
+docker run --gpus '"device=0,2"' ...                 # specific GPU indices
+docker run --gpus '"device=GPU-abc123..."' ...       # by UUID
+docker run --gpus '"device=MIG-d1..."' ...           # by MIG slice UUID
+# Pass through capabilities and environment
+docker run --gpus all \
+  --env NVIDIA_VISIBLE_DEVICES=all \
+  --env NVIDIA_DRIVER_CAPABILITIES=compute,utility \
+  -v /data:/workspace/data \
+  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
+  nvcr.io/nvidia/pytorch:24.10-py3
+</code>
+The ''--ipc=host'' and ''ulimit'' flags come up because PyTorch/NCCL multi-process workloads need shared memory and unlocked memory limits to work right.
+==== 5.4 Troubleshooting containers ====
+<code bash>
+# Inventory
+docker ps                            # running containers
+docker ps -a                         # all containers, including stopped/exited
+docker images                        # local image cache
+docker system df                     # disk usage by images/containers/volumes
+# Inspection
+docker logs <container>              # stdout/stderr
+docker logs -f <container>           # follow
+docker logs --tail 100 <container>   # last N lines
+docker logs --since 10m <container>  # time-bounded
+docker stats                         # live CPU/mem/net/IO per container
+docker stats --no-stream <container> # one-shot snapshot
+docker inspect <container>           # full JSON config + state
+docker inspect <container> | jq '.[0].State'   # exit code, OOMKilled, etc.
+docker top <container>               # processes inside the container
+docker exec -it <container> bash     # shell into it
+# Daemon-level
+docker events                        # live event stream from the daemon
+journalctl -u docker -f              # daemon logs via systemd
+journalctl -u containerd -f          # containerd logs (k8s nodes)
+</code>
+==== 5.5 Error messages to recognize on sight ====
+^ Error                                                                  ^ Cause                                                                  ^
+| ''could not select device driver "" with capabilities: [[gpu]]''       | NVIDIA Container Toolkit not installed/configured, or no GPU on host    |
+| ''Failed to initialize NVML: Driver/library version mismatch''         | Host driver was updated, container has stale ''libnvidia-ml.so'' loaded |
+| ''nvidia-container-cli: initialization error: nvml error: driver/library version mismatch'' | Reboot the host — kernel module out of sync with userspace            |
+| ''OCI runtime create failed: ... unknown capability "nvidia"''         | Runtime not registered with Docker — rerun ''nvidia-ctk runtime configure'' |
+| Container starts but ''nvidia-smi'' says ''No devices were found''     | ''NVIDIA_VISIBLE_DEVICES'' was overridden to ''none'' or ''void''       |
+| ''docker: Error response from daemon: pull access denied for nvcr.io/...'' | Not logged into NGC, or API key expired                            |
+==== 5.6 containerd specifics for Kubernetes ====
+When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and ''crictl''.
+<code bash>
+# Inspect what containerd sees
+crictl info | jq '.config.containerd.runtimes'
+crictl ps -a                         # all containers, including stopped
+crictl logs <container-id>
+crictl inspect <container-id>
+# Smoke-test the runtime directly without k8s in the middle
+ctr image pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
+ctr run --rm --gpus 0 \
+  nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
+</code>
+If ''ctr run --gpus'' works but a pod doesn't, the problem is in the device plugin or k8s config — not the runtime.
+==== 5.7 Practice scenario ====
+> A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run with ''%%docker run --gpus all%%''. Diagnose.
+<code bash>
+# 1. Was there any output at all?
+docker logs <container-id>
+docker inspect <container-id> | jq '.[0].State'
+# Look at ExitCode and OOMKilled
+# 2. Was the toolkit even invoked?
+docker info | grep -i runtime
+# Should list nvidia among runtimes
+# 3. Can a minimal container see the GPU?
+docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
+# 4. If the minimal works but TF doesn't, it's app-level. Check the image's entrypoint.
+docker run --rm --gpus all -it --entrypoint bash nvcr.io/nvidia/tensorflow:<tag>
+# Inside, manually run the entrypoint to see what fails
+# 5. Common gotchas:
+#    - Missing --ipc=host (silent crash on first NCCL call)
+#    - --memory limit below model size (OOMKilled, ExitCode=137)
+#    - Mounted dataset path doesn't exist inside the container
+</code>
+----
+===== 6. Run:ai (NVIDIA Run:ai) =====
+Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the **conceptual model is highly testable**. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI.
+Note: the open-source **KAI Scheduler** is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster's Kubernetes to practice the scheduling concepts (queues, fair-share, gang scheduling) without a Run:ai license.
+==== 6.1 The conceptual model ====
+The hierarchy, top to bottom:
+  - **Tenant** — top-level isolation (usually one per organization)
+  - **Cluster** — a Kubernetes cluster managed by Run:ai
+  - **Department** — group of projects, gets a slice of cluster quota
+  - **Project** — the multi-tenancy unit; users belong to projects, workloads run inside them
+  - **Workload** — the actual thing running (workspace, training, inference, etc.)
+==== 6.2 Quota model ====
+This comes up constantly on the exam.
+  * **Deserved quota** (a.k.a. //guaranteed//) — GPUs your project is guaranteed when it asks. **Cannot be preempted.**
+  * **Over-quota** — opportunistic use of idle GPUs beyond your deserved quota. **Can be preempted** when another project wants its deserved share back.
+  * **Over-quota priority** — which project gets leftovers first when multiple projects compete for over-quota.
+  * **Over-quota weight** — proportional fair-share for over-quota allocation across projects with equal priority.
+Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed.
+==== 6.3 Workload types ====
+^ Type                       ^ Use case                                  ^ Priority    ^ Preemptible ^
+| **Workspace**              | Interactive dev (Jupyter, VSCode)         | Low         | Yes         |
+| **Training**               | Batch training jobs                       | Medium      | Yes         |
+| **Distributed Training**   | Multi-node PyTorch/TF/MPI/XGBoost         | Medium      | Yes (gang)  |
+| **Inference**              | Model serving with autoscaling            | High        | **No**      |
+**Priority order, low → high:** Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce.
+**Gang scheduling** for distributed training: all worker pods start together or none do. No partial-start states.
+==== 6.4 The runai CLI ====
+<code bash>
+# Authentication and context
+runai login
+runai config project <project-name>
+runai list projects
+runai list clusters
+# Submitting — know the variants
+runai submit-workspace <name> --image jupyter/base-notebook --gpu 1
+runai submit <name> --image pytorch/pytorch --gpu 2 \
+  --command -- python train.py
+runai submit-dist pytorch <name> --workers 4 --gpu 1 \
+  --image nvcr.io/nvidia/pytorch:24.10-py3
+# Fractional GPU — three different ways
+runai submit <name> --gpu 1                  # whole GPU
+runai submit <name> --gpu-memory 4G          # fraction by memory (4 GB)
+runai submit <name> --gpu-fraction 0.5       # fraction by share (50%)
+# Targeting hardware
+runai submit <name> --node-pools "h100-pool"
+runai submit <name> --node-type "dgx-a100"
+# Inspecting
+runai list jobs
+runai describe job <name>
+runai logs <name>
+runai logs -f <name>
+runai bash <name>                            # exec into the pod
+runai port-forward <name> --port 8888
+# Lifecycle
+runai delete job <name>
+runai suspend job <name>
+runai resume job <name>
+</code>
+The difference between ''%%--gpu 1%%'', ''%%--gpu-memory 4G%%'', and ''%%--gpu-fraction 0.5%%'' is a **classic exam question** — whole GPU vs fraction by memory vs fraction by share.
+==== 6.5 Fractional GPU is NOT MIG ====
+^ Property              ^ MIG                            ^ Run:ai Fractional GPU                 ^
+| Isolation             | Hardware (memory, SMs, L2)     | Software (runtime-enforced)            |
+| Supported GPUs        | A100, A30, H100, H200, B200    | Any NVIDIA GPU                         |
+| Slice sizes           | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share)     |
+| Reconfiguration       | Requires draining the GPU      | Per-workload, no GPU reset             |
+| Memory protection     | Hardware-enforced              | Process-level via runtime              |
+If a question asks "tenant needs hardware-isolated GPU partitioning across an A100" → **MIG**. If it asks "team needs to run lots of small Jupyter notebooks on a V100/T4/L4" → **Run:ai Fractional GPU** (those GPUs don't support MIG).
+==== 6.6 Node pools ====
+Node pools segment the cluster by hardware capability.
+<code bash>
+# Typical pools in a heterogeneous cluster:
+#   a100-pool       - A100 nodes, MIG-enabled
+#   h100-pool       - H100 nodes
+#   l40-pool        - inference-grade GPUs
+#   cpu-pool        - CPU-only nodes for data prep
+# Projects are granted access to specific pools, with separate quotas per pool.
+# Workloads target pools at submission time.
+runai submit train-large --node-pools "h100-pool" --gpu 8
+runai submit prep-data --node-pools "cpu-pool" --cpu 16
+</code>
+==== 6.7 Architecture ====
+  * **Run:ai Control Plane** — the management layer. Can be **SaaS** (hosted by NVIDIA) or **self-hosted**. One Control Plane can manage many clusters.
+  * **Run:ai Cluster components** — installed via Helm onto each managed Kubernetes cluster:
+    * The scheduler (now based on KAI Scheduler)
+    * The Run:ai operator
+    * The fractional GPU runtime
+    * Monitoring agents
+    * Workload controllers (one per workload type)
+Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift.
+==== 6.8 KAI Scheduler — the open-source path ====
+Since full Run:ai requires a license, the open-source **KAI Scheduler** is the practical way to drill the scheduling concepts on your own cluster.
+<code bash>
+# Install on an existing Kubernetes cluster
+helm repo add nvidia-kai-scheduler https://nvidia.github.io/KAI-Scheduler
+helm install kai-scheduler nvidia-kai-scheduler/kai-scheduler \
+  -n kai-scheduler --create-namespace
+# Verify
+kubectl get pods -n kai-scheduler
+</code>
+What you can practice with KAI alone (no Run:ai license needed):
+  * Queue hierarchies and weighted fair-share
+  * Gang scheduling for distributed pods
+  * Pod priority and preemption
+  * Resource quotas across queues
+  * Even without GPUs — use CPU-only pods to practice the scheduler logic
+==== 6.9 Practice scenario ====
+> A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose.
+This is **expected behavior**, not a bug. Workspaces are low-priority and preemptible; training jobs are higher priority. When a training job submitted by another project wants GPUs that the data science project is using only as over-quota, Run:ai preempts the workspace. Options:
+  - Increase the data science project's **deserved quota** so the workspaces are no longer in over-quota territory.
+  - Move long-running data science work into **training-type workloads** (higher priority than workspaces).
+  - Add **node pool** segregation — a dedicated pool for interactive work that batch training jobs can't preempt into.
+  - Schedule batch training jobs to a **different time window** via job priority/scheduling rules.
 ----
@@ Line 691: / Line 1000: @@
   * **Slurm reason codes** — ''Resources'', ''Priority'', ''ReqNodeNotAvail'', ''AssocGrpGRESLimit'', ''QOSMaxGRESPerUser'', ''ReqGresTypeNotAvail''.
   * **NCCL env vars** — ''NCCL_DEBUG'', ''NCCL_IB_HCA'', ''NCCL_SOCKET_IFNAME''.
+  * **NGC auth — username is literally ''%%$oauthtoken%%''**, password is the NGC API key. This trips people up every exam cycle.
+  * **Docker ''%%--gpus%%'' flag variants** — ''all'', ''2'', ''%%"device=0,2"%%'', by UUID, by MIG UUID.
+  * **The "could not select device driver" error** = NVIDIA Container Toolkit missing or no GPU on host.
+  * **Run:ai quota model** — //deserved// is guaranteed, //over-quota// is preemptible. Inference > Training > Workspace in priority.
+  * **Fractional GPU ≠ MIG** — Fractional is software-enforced and works on any NVIDIA GPU; MIG is hardware-enforced and only on A100/H100/etc.
+  * **runai submit flags** — ''%%--gpu 1%%'' (whole), ''%%--gpu-memory 4G%%'' (fraction by memory), ''%%--gpu-fraction 0.5%%'' (fraction by share).
 Good luck — build the labs, break them on purpose, fix them, repeat.

Combined Cloud Managed Services

Site Tools

Differences

Page Tools