This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| wiki:ai:ncp-aio-lab-training [2026/05/18 14:19] – ddehamer | wiki:ai:ncp-aio-lab-training [2026/05/18 16:32] (current) – ddehamer | ||
|---|---|---|---|
| Line 678: | Line 678: | ||
| | Multi-node training slow | '' | | Multi-node training slow | '' | ||
| | ECC errors / Xid 48/ | | ECC errors / Xid 48/ | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ===== 5. Docker & Container Runtimes ===== | ||
| + | |||
| + | Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/ | ||
| + | |||
| + | ==== 5.1 The install order — memorize it ==== | ||
| + | |||
| + | - **NVIDIA driver** (host kernel module) | ||
| + | - **Docker** (or containerd for k8s nodes) | ||
| + | - **NVIDIA Container Toolkit** (provides the '' | ||
| + | - **Configure the runtime** to advertise the toolkit | ||
| + | |||
| + | <code bash> | ||
| + | # Configure Docker to use the nvidia runtime | ||
| + | sudo nvidia-ctk runtime configure --runtime=docker | ||
| + | sudo systemctl restart docker | ||
| + | |||
| + | # Configure containerd (for Kubernetes nodes) | ||
| + | sudo nvidia-ctk runtime configure --runtime=containerd | ||
| + | sudo systemctl restart containerd | ||
| + | |||
| + | # Verify the runtime is wired in | ||
| + | docker info | grep -i runtime | ||
| + | cat / | ||
| + | </ | ||
| + | |||
| + | ==== 5.2 NGC registry authentication ==== | ||
| + | |||
| + | Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from '' | ||
| + | |||
| + | <code bash> | ||
| + | docker login nvcr.io | ||
| + | # Username: $oauthtoken | ||
| + | # Password: <your NGC API key> | ||
| + | |||
| + | # Or non-interactive (for scripts / cmsh image builds) | ||
| + | echo " | ||
| + | |||
| + | # Then pull | ||
| + | docker pull nvcr.io/ | ||
| + | docker pull nvcr.io/ | ||
| + | docker pull nvcr.io/ | ||
| + | </ | ||
| + | |||
| + | The username is literally '' | ||
| + | |||
| + | ==== 5.3 Running GPU containers ==== | ||
| + | |||
| + | <code bash> | ||
| + | # Verify GPU access from a container (the smoke test) | ||
| + | docker run --rm --gpus all nvcr.io/ | ||
| + | |||
| + | # GPU selection flags — know all the forms | ||
| + | docker run --gpus all ... # all GPUs | ||
| + | docker run --gpus 2 ... # any 2 GPUs | ||
| + | docker run --gpus '" | ||
| + | docker run --gpus '" | ||
| + | docker run --gpus '" | ||
| + | |||
| + | # Pass through capabilities and environment | ||
| + | docker run --gpus all \ | ||
| + | --env NVIDIA_VISIBLE_DEVICES=all \ | ||
| + | --env NVIDIA_DRIVER_CAPABILITIES=compute, | ||
| + | -v / | ||
| + | --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ | ||
| + | nvcr.io/ | ||
| + | </ | ||
| + | |||
| + | The '' | ||
| + | |||
| + | ==== 5.4 Troubleshooting containers ==== | ||
| + | |||
| + | <code bash> | ||
| + | # Inventory | ||
| + | docker ps # running containers | ||
| + | docker ps -a # all containers, including stopped/ | ||
| + | docker images | ||
| + | docker system df # disk usage by images/ | ||
| + | |||
| + | # Inspection | ||
| + | docker logs < | ||
| + | docker logs -f < | ||
| + | docker logs --tail 100 < | ||
| + | docker logs --since 10m < | ||
| + | docker stats # live CPU/ | ||
| + | docker stats --no-stream < | ||
| + | docker inspect < | ||
| + | docker inspect < | ||
| + | docker top < | ||
| + | docker exec -it < | ||
| + | |||
| + | # Daemon-level | ||
| + | docker events | ||
| + | journalctl -u docker -f # daemon logs via systemd | ||
| + | journalctl -u containerd -f # containerd logs (k8s nodes) | ||
| + | </ | ||
| + | |||
| + | ==== 5.5 Error messages to recognize on sight ==== | ||
| + | |||
| + | ^ Error ^ Cause ^ | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | | Container starts but '' | ||
| + | | '' | ||
| + | |||
| + | ==== 5.6 containerd specifics for Kubernetes ==== | ||
| + | |||
| + | When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and '' | ||
| + | |||
| + | <code bash> | ||
| + | # Inspect what containerd sees | ||
| + | crictl info | jq ' | ||
| + | crictl ps -a # all containers, including stopped | ||
| + | crictl logs < | ||
| + | crictl inspect < | ||
| + | |||
| + | # Smoke-test the runtime directly without k8s in the middle | ||
| + | ctr image pull nvcr.io/ | ||
| + | ctr run --rm --gpus 0 \ | ||
| + | nvcr.io/ | ||
| + | </ | ||
| + | |||
| + | If '' | ||
| + | |||
| + | ==== 5.7 Practice scenario ==== | ||
| + | |||
| + | > A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run with '' | ||
| + | |||
| + | <code bash> | ||
| + | # 1. Was there any output at all? | ||
| + | docker logs < | ||
| + | docker inspect < | ||
| + | # Look at ExitCode and OOMKilled | ||
| + | |||
| + | # 2. Was the toolkit even invoked? | ||
| + | docker info | grep -i runtime | ||
| + | # Should list nvidia among runtimes | ||
| + | |||
| + | # 3. Can a minimal container see the GPU? | ||
| + | docker run --rm --gpus all nvcr.io/ | ||
| + | |||
| + | # 4. If the minimal works but TF doesn' | ||
| + | docker run --rm --gpus all -it --entrypoint bash nvcr.io/ | ||
| + | # Inside, manually run the entrypoint to see what fails | ||
| + | |||
| + | # 5. Common gotchas: | ||
| + | # - Missing --ipc=host (silent crash on first NCCL call) | ||
| + | # - --memory limit below model size (OOMKilled, ExitCode=137) | ||
| + | # - Mounted dataset path doesn' | ||
| + | </ | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ===== 6. Run:ai (NVIDIA Run:ai) ===== | ||
| + | |||
| + | Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the **conceptual model is highly testable**. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI. | ||
| + | |||
| + | Note: the open-source **KAI Scheduler** is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster' | ||
| + | |||
| + | ==== 6.1 The conceptual model ==== | ||
| + | |||
| + | The hierarchy, top to bottom: | ||
| + | |||
| + | - **Tenant** — top-level isolation (usually one per organization) | ||
| + | - **Cluster** — a Kubernetes cluster managed by Run:ai | ||
| + | - **Department** — group of projects, gets a slice of cluster quota | ||
| + | - **Project** — the multi-tenancy unit; users belong to projects, workloads run inside them | ||
| + | - **Workload** — the actual thing running (workspace, training, inference, etc.) | ||
| + | |||
| + | ==== 6.2 Quota model ==== | ||
| + | |||
| + | This comes up constantly on the exam. | ||
| + | |||
| + | * **Deserved quota** (a.k.a. // | ||
| + | * **Over-quota** — opportunistic use of idle GPUs beyond your deserved quota. **Can be preempted** when another project wants its deserved share back. | ||
| + | * **Over-quota priority** — which project gets leftovers first when multiple projects compete for over-quota. | ||
| + | * **Over-quota weight** — proportional fair-share for over-quota allocation across projects with equal priority. | ||
| + | |||
| + | Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed. | ||
| + | |||
| + | ==== 6.3 Workload types ==== | ||
| + | |||
| + | ^ Type ^ Use case ^ Priority | ||
| + | | **Workspace** | ||
| + | | **Training** | ||
| + | | **Distributed Training** | ||
| + | | **Inference** | ||
| + | |||
| + | **Priority order, low → high:** Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce. | ||
| + | |||
| + | **Gang scheduling** for distributed training: all worker pods start together or none do. No partial-start states. | ||
| + | |||
| + | ==== 6.4 The runai CLI ==== | ||
| + | |||
| + | <code bash> | ||
| + | # Authentication and context | ||
| + | runai login | ||
| + | runai config project < | ||
| + | runai list projects | ||
| + | runai list clusters | ||
| + | |||
| + | # Submitting — know the variants | ||
| + | runai submit-workspace < | ||
| + | runai submit < | ||
| + | --command -- python train.py | ||
| + | runai submit-dist pytorch < | ||
| + | --image nvcr.io/ | ||
| + | |||
| + | # Fractional GPU — three different ways | ||
| + | runai submit < | ||
| + | runai submit < | ||
| + | runai submit < | ||
| + | |||
| + | # Targeting hardware | ||
| + | runai submit < | ||
| + | runai submit < | ||
| + | |||
| + | # Inspecting | ||
| + | runai list jobs | ||
| + | runai describe job < | ||
| + | runai logs < | ||
| + | runai logs -f < | ||
| + | runai bash < | ||
| + | runai port-forward < | ||
| + | |||
| + | # Lifecycle | ||
| + | runai delete job < | ||
| + | runai suspend job < | ||
| + | runai resume job < | ||
| + | </ | ||
| + | |||
| + | The difference between '' | ||
| + | |||
| + | ==== 6.5 Fractional GPU is NOT MIG ==== | ||
| + | |||
| + | ^ Property | ||
| + | | Isolation | ||
| + | | Supported GPUs | A100, A30, H100, H200, B200 | Any NVIDIA GPU | | ||
| + | | Slice sizes | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share) | ||
| + | | Reconfiguration | ||
| + | | Memory protection | ||
| + | |||
| + | If a question asks " | ||
| + | |||
| + | ==== 6.6 Node pools ==== | ||
| + | |||
| + | Node pools segment the cluster by hardware capability. | ||
| + | |||
| + | <code bash> | ||
| + | # Typical pools in a heterogeneous cluster: | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | |||
| + | # Projects are granted access to specific pools, with separate quotas per pool. | ||
| + | # Workloads target pools at submission time. | ||
| + | runai submit train-large --node-pools " | ||
| + | runai submit prep-data --node-pools " | ||
| + | </ | ||
| + | |||
| + | ==== 6.7 Architecture ==== | ||
| + | |||
| + | * **Run:ai Control Plane** — the management layer. Can be **SaaS** (hosted by NVIDIA) or **self-hosted**. One Control Plane can manage many clusters. | ||
| + | * **Run:ai Cluster components** — installed via Helm onto each managed Kubernetes cluster: | ||
| + | * The scheduler (now based on KAI Scheduler) | ||
| + | * The Run:ai operator | ||
| + | * The fractional GPU runtime | ||
| + | * Monitoring agents | ||
| + | * Workload controllers (one per workload type) | ||
| + | |||
| + | Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift. | ||
| + | |||
| + | ==== 6.8 KAI Scheduler — the open-source path ==== | ||
| + | |||
| + | Since full Run:ai requires a license, the open-source **KAI Scheduler** is the practical way to drill the scheduling concepts on your own cluster. | ||
| + | |||
| + | <code bash> | ||
| + | # Install on an existing Kubernetes cluster | ||
| + | helm repo add nvidia-kai-scheduler https:// | ||
| + | helm install kai-scheduler nvidia-kai-scheduler/ | ||
| + | -n kai-scheduler --create-namespace | ||
| + | |||
| + | # Verify | ||
| + | kubectl get pods -n kai-scheduler | ||
| + | </ | ||
| + | |||
| + | What you can practice with KAI alone (no Run:ai license needed): | ||
| + | |||
| + | * Queue hierarchies and weighted fair-share | ||
| + | * Gang scheduling for distributed pods | ||
| + | * Pod priority and preemption | ||
| + | * Resource quotas across queues | ||
| + | * Even without GPUs — use CPU-only pods to practice the scheduler logic | ||
| + | |||
| + | ==== 6.9 Practice scenario ==== | ||
| + | |||
| + | > A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose. | ||
| + | |||
| + | This is **expected behavior**, not a bug. Workspaces are low-priority and preemptible; | ||
| + | |||
| + | - Increase the data science project' | ||
| + | - Move long-running data science work into **training-type workloads** (higher priority than workspaces). | ||
| + | - Add **node pool** segregation — a dedicated pool for interactive work that batch training jobs can't preempt into. | ||
| + | - Schedule batch training jobs to a **different time window** via job priority/ | ||
| ---- | ---- | ||
| Line 691: | Line 1000: | ||
| * **Slurm reason codes** — '' | * **Slurm reason codes** — '' | ||
| * **NCCL env vars** — '' | * **NCCL env vars** — '' | ||
| + | * **NGC auth — username is literally '' | ||
| + | * **Docker '' | ||
| + | * **The "could not select device driver" | ||
| + | * **Run:ai quota model** — // | ||
| + | * **Fractional GPU ≠ MIG** — Fractional is software-enforced and works on any NVIDIA GPU; MIG is hardware-enforced and only on A100/ | ||
| + | * **runai submit flags** — '' | ||
| Good luck — build the labs, break them on purpose, fix them, repeat. | Good luck — build the labs, break them on purpose, fix them, repeat. | ||