User Tools

Site Tools


wiki:ai:ncp-aio-lab-training

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
wiki:ai:ncp-aio-lab-training [2026/05/18 16:08] – [1.1 Getting into BCM] ddehamerwiki:ai:ncp-aio-lab-training [2026/05/18 16:32] (current) ddehamer
Line 19: Line 19:
 [head]% help [head]% help
 [head]% main             # return to the top-level prompt [head]% main             # return to the top-level prompt
-[head->main]% licenseinfo +[head]% status           # overall cluster status (services, license, HA) 
-[head->main]% partition use base+[head]% partition use base
 </code> </code>
  
Line 678: Line 678:
 | Multi-node training slow               | ''NCCL_DEBUG=INFO''                      | IB link state, topo, GPU<->NIC                  | | Multi-node training slow               | ''NCCL_DEBUG=INFO''                      | IB link state, topo, GPU<->NIC                  |
 | ECC errors / Xid 48/63/64              | ''nvidia-smi -q -d ECC''                 | Drain node, page retirement                     | | ECC errors / Xid 48/63/64              | ''nvidia-smi -q -d ECC''                 | Drain node, page retirement                     |
 +
 +----
 +
 +===== 5. Docker & Container Runtimes =====
 +
 +Docker isn't a standalone domain on the NCP-AIO blueprint, but it's woven through Installation/Deployment and Troubleshooting. The exam treats Docker as the **substrate** — what gets tested is "does this container have GPU access?" and "how do I diagnose it when something's wrong?" Know the runtime layer for standalone GPU servers (Docker) and for Kubernetes nodes (containerd).
 +
 +==== 5.1 The install order — memorize it ====
 +
 +  - **NVIDIA driver** (host kernel module)
 +  - **Docker** (or containerd for k8s nodes)
 +  - **NVIDIA Container Toolkit** (provides the ''nvidia'' runtime shim)
 +  - **Configure the runtime** to advertise the toolkit
 +
 +<code bash>
 +# Configure Docker to use the nvidia runtime
 +sudo nvidia-ctk runtime configure --runtime=docker
 +sudo systemctl restart docker
 +
 +# Configure containerd (for Kubernetes nodes)
 +sudo nvidia-ctk runtime configure --runtime=containerd
 +sudo systemctl restart containerd
 +
 +# Verify the runtime is wired in
 +docker info | grep -i runtime
 +cat /etc/containerd/config.toml | grep -A5 nvidia
 +</code>
 +
 +==== 5.2 NGC registry authentication ====
 +
 +Pulling NVIDIA-curated containers (PyTorch, TensorFlow, CUDA, RAPIDS, NeMo, Triton) from ''nvcr.io'' requires auth. **This is the classic trip-up question on the exam.**
 +
 +<code bash>
 +docker login nvcr.io
 +# Username: $oauthtoken           <-- LITERAL string, with the dollar sign
 +# Password: <your NGC API key>    <-- from ngc.nvidia.com
 +
 +# Or non-interactive (for scripts / cmsh image builds)
 +echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
 +
 +# Then pull
 +docker pull nvcr.io/nvidia/pytorch:24.10-py3
 +docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
 +docker pull nvcr.io/nvidia/tensorrt:24.10-py3
 +</code>
 +
 +The username is literally ''%%$oauthtoken%%'' as a string — not your NVIDIA email, not a shell variable. Memorize this.
 +
 +==== 5.3 Running GPU containers ====
 +
 +<code bash>
 +# Verify GPU access from a container (the smoke test)
 +docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
 +
 +# GPU selection flags — know all the forms
 +docker run --gpus all ...                            # all GPUs
 +docker run --gpus 2 ...                              # any 2 GPUs
 +docker run --gpus '"device=0,2"' ...                 # specific GPU indices
 +docker run --gpus '"device=GPU-abc123..."' ...       # by UUID
 +docker run --gpus '"device=MIG-d1..."' ...           # by MIG slice UUID
 +
 +# Pass through capabilities and environment
 +docker run --gpus all \
 +  --env NVIDIA_VISIBLE_DEVICES=all \
 +  --env NVIDIA_DRIVER_CAPABILITIES=compute,utility \
 +  -v /data:/workspace/data \
 +  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
 +  nvcr.io/nvidia/pytorch:24.10-py3
 +</code>
 +
 +The ''--ipc=host'' and ''ulimit'' flags come up because PyTorch/NCCL multi-process workloads need shared memory and unlocked memory limits to work right.
 +
 +==== 5.4 Troubleshooting containers ====
 +
 +<code bash>
 +# Inventory
 +docker ps                            # running containers
 +docker ps -a                         # all containers, including stopped/exited
 +docker images                        # local image cache
 +docker system df                     # disk usage by images/containers/volumes
 +
 +# Inspection
 +docker logs <container>              # stdout/stderr
 +docker logs -f <container>           # follow
 +docker logs --tail 100 <container>   # last N lines
 +docker logs --since 10m <container>  # time-bounded
 +docker stats                         # live CPU/mem/net/IO per container
 +docker stats --no-stream <container> # one-shot snapshot
 +docker inspect <container>           # full JSON config + state
 +docker inspect <container> | jq '.[0].State'   # exit code, OOMKilled, etc.
 +docker top <container>               # processes inside the container
 +docker exec -it <container> bash     # shell into it
 +
 +# Daemon-level
 +docker events                        # live event stream from the daemon
 +journalctl -u docker -f              # daemon logs via systemd
 +journalctl -u containerd -f          # containerd logs (k8s nodes)
 +</code>
 +
 +==== 5.5 Error messages to recognize on sight ====
 +
 +^ Error                                                                  ^ Cause                                                                  ^
 +| ''could not select device driver "" with capabilities: [[gpu]]''       | NVIDIA Container Toolkit not installed/configured, or no GPU on host    |
 +| ''Failed to initialize NVML: Driver/library version mismatch''         | Host driver was updated, container has stale ''libnvidia-ml.so'' loaded |
 +| ''nvidia-container-cli: initialization error: nvml error: driver/library version mismatch'' | Reboot the host — kernel module out of sync with userspace            |
 +| ''OCI runtime create failed: ... unknown capability "nvidia"''         | Runtime not registered with Docker — rerun ''nvidia-ctk runtime configure'' |
 +| Container starts but ''nvidia-smi'' says ''No devices were found''     | ''NVIDIA_VISIBLE_DEVICES'' was overridden to ''none'' or ''void''       |
 +| ''docker: Error response from daemon: pull access denied for nvcr.io/...'' | Not logged into NGC, or API key expired                            |
 +
 +==== 5.6 containerd specifics for Kubernetes ====
 +
 +When a pod can't see a GPU on a node, the Docker-level commands don't apply — you're working with containerd and ''crictl''.
 +
 +<code bash>
 +# Inspect what containerd sees
 +crictl info | jq '.config.containerd.runtimes'
 +crictl ps -a                         # all containers, including stopped
 +crictl logs <container-id>
 +crictl inspect <container-id>
 +
 +# Smoke-test the runtime directly without k8s in the middle
 +ctr image pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
 +ctr run --rm --gpus 0 \
 +  nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 smoke nvidia-smi
 +</code>
 +
 +If ''ctr run --gpus'' works but a pod doesn't, the problem is in the device plugin or k8s config — not the runtime.
 +
 +==== 5.7 Practice scenario ====
 +
 +> A user reports that a TensorFlow container they pulled from NGC starts but immediately exits with no logs when run with ''%%docker run --gpus all%%''. Diagnose.
 +
 +<code bash>
 +# 1. Was there any output at all?
 +docker logs <container-id>
 +docker inspect <container-id> | jq '.[0].State'
 +# Look at ExitCode and OOMKilled
 +
 +# 2. Was the toolkit even invoked?
 +docker info | grep -i runtime
 +# Should list nvidia among runtimes
 +
 +# 3. Can a minimal container see the GPU?
 +docker run --rm --gpus all nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
 +
 +# 4. If the minimal works but TF doesn't, it's app-level. Check the image's entrypoint.
 +docker run --rm --gpus all -it --entrypoint bash nvcr.io/nvidia/tensorflow:<tag>
 +# Inside, manually run the entrypoint to see what fails
 +
 +# 5. Common gotchas:
 +#    - Missing --ipc=host (silent crash on first NCCL call)
 +#    - --memory limit below model size (OOMKilled, ExitCode=137)
 +#    - Mounted dataset path doesn't exist inside the container
 +</code>
 +
 +----
 +
 +===== 6. Run:ai (NVIDIA Run:ai) =====
 +
 +Run:ai is the Kubernetes-native GPU orchestration platform NVIDIA acquired in 2024. Full Run:ai requires a license, so lab access is harder to come by — but the **conceptual model is highly testable**. Memorize the hierarchy, quota model, and CLI cold; you'll get scenario questions even without ever clicking through the UI.
 +
 +Note: the open-source **KAI Scheduler** is the scheduling engine extracted from Run:ai (Apache 2.0). You can install it on your BCM cluster's Kubernetes to practice the scheduling concepts (queues, fair-share, gang scheduling) without a Run:ai license.
 +
 +==== 6.1 The conceptual model ====
 +
 +The hierarchy, top to bottom:
 +
 +  - **Tenant** — top-level isolation (usually one per organization)
 +  - **Cluster** — a Kubernetes cluster managed by Run:ai
 +  - **Department** — group of projects, gets a slice of cluster quota
 +  - **Project** — the multi-tenancy unit; users belong to projects, workloads run inside them
 +  - **Workload** — the actual thing running (workspace, training, inference, etc.)
 +
 +==== 6.2 Quota model ====
 +
 +This comes up constantly on the exam.
 +
 +  * **Deserved quota** (a.k.a. //guaranteed//) — GPUs your project is guaranteed when it asks. **Cannot be preempted.**
 +  * **Over-quota** — opportunistic use of idle GPUs beyond your deserved quota. **Can be preempted** when another project wants its deserved share back.
 +  * **Over-quota priority** — which project gets leftovers first when multiple projects compete for over-quota.
 +  * **Over-quota weight** — proportional fair-share for over-quota allocation across projects with equal priority.
 +
 +Example: Project A has deserved=4 GPUs, Project B has deserved=4 GPUs, cluster has 16 GPUs total. If A is using 10 (4 deserved + 6 over-quota) and B suddenly asks for its full 4, Run:ai will preempt 2 of A's over-quota workloads to give B what it's owed.
 +
 +==== 6.3 Workload types ====
 +
 +^ Type                       ^ Use case                                  ^ Priority    ^ Preemptible ^
 +| **Workspace**              | Interactive dev (Jupyter, VSCode)         | Low         | Yes         |
 +| **Training**               | Batch training jobs                       | Medium      | Yes         |
 +| **Distributed Training**   | Multi-node PyTorch/TF/MPI/XGBoost         | Medium      | Yes (gang)  |
 +| **Inference**              | Model serving with autoscaling            | High        | **No**      |
 +
 +**Priority order, low → high:** Workspace < Training < Inference. Higher-priority workloads preempt lower ones when GPUs are scarce.
 +
 +**Gang scheduling** for distributed training: all worker pods start together or none do. No partial-start states.
 +
 +==== 6.4 The runai CLI ====
 +
 +<code bash>
 +# Authentication and context
 +runai login
 +runai config project <project-name>
 +runai list projects
 +runai list clusters
 +
 +# Submitting — know the variants
 +runai submit-workspace <name> --image jupyter/base-notebook --gpu 1
 +runai submit <name> --image pytorch/pytorch --gpu 2 \
 +  --command -- python train.py
 +runai submit-dist pytorch <name> --workers 4 --gpu 1 \
 +  --image nvcr.io/nvidia/pytorch:24.10-py3
 +
 +# Fractional GPU — three different ways
 +runai submit <name> --gpu 1                  # whole GPU
 +runai submit <name> --gpu-memory 4G          # fraction by memory (4 GB)
 +runai submit <name> --gpu-fraction 0.5       # fraction by share (50%)
 +
 +# Targeting hardware
 +runai submit <name> --node-pools "h100-pool"
 +runai submit <name> --node-type "dgx-a100"
 +
 +# Inspecting
 +runai list jobs
 +runai describe job <name>
 +runai logs <name>
 +runai logs -f <name>
 +runai bash <name>                            # exec into the pod
 +runai port-forward <name> --port 8888
 +
 +# Lifecycle
 +runai delete job <name>
 +runai suspend job <name>
 +runai resume job <name>
 +</code>
 +
 +The difference between ''%%--gpu 1%%'', ''%%--gpu-memory 4G%%'', and ''%%--gpu-fraction 0.5%%'' is a **classic exam question** — whole GPU vs fraction by memory vs fraction by share.
 +
 +==== 6.5 Fractional GPU is NOT MIG ====
 +
 +^ Property              ^ MIG                            ^ Run:ai Fractional GPU                 ^
 +| Isolation             | Hardware (memory, SMs, L2)     | Software (runtime-enforced)            |
 +| Supported GPUs        | A100, A30, H100, H200, B200    | Any NVIDIA GPU                         |
 +| Slice sizes           | Fixed profiles (1g.10gb, etc.) | Arbitrary (GB or fractional share)     |
 +| Reconfiguration       | Requires draining the GPU      | Per-workload, no GPU reset             |
 +| Memory protection     | Hardware-enforced              | Process-level via runtime              |
 +
 +If a question asks "tenant needs hardware-isolated GPU partitioning across an A100" → **MIG**. If it asks "team needs to run lots of small Jupyter notebooks on a V100/T4/L4" → **Run:ai Fractional GPU** (those GPUs don't support MIG).
 +
 +==== 6.6 Node pools ====
 +
 +Node pools segment the cluster by hardware capability.
 +
 +<code bash>
 +# Typical pools in a heterogeneous cluster:
 +#   a100-pool       - A100 nodes, MIG-enabled
 +#   h100-pool       - H100 nodes
 +#   l40-pool        - inference-grade GPUs
 +#   cpu-pool        - CPU-only nodes for data prep
 +
 +# Projects are granted access to specific pools, with separate quotas per pool.
 +# Workloads target pools at submission time.
 +runai submit train-large --node-pools "h100-pool" --gpu 8
 +runai submit prep-data --node-pools "cpu-pool" --cpu 16
 +</code>
 +
 +==== 6.7 Architecture ====
 +
 +  * **Run:ai Control Plane** — the management layer. Can be **SaaS** (hosted by NVIDIA) or **self-hosted**. One Control Plane can manage many clusters.
 +  * **Run:ai Cluster components** — installed via Helm onto each managed Kubernetes cluster:
 +    * The scheduler (now based on KAI Scheduler)
 +    * The Run:ai operator
 +    * The fractional GPU runtime
 +    * Monitoring agents
 +    * Workload controllers (one per workload type)
 +
 +Integrations to recognize: SSO via SAML/OIDC, LDAP, Kubeflow, MLflow, Argo Workflows, Ray, OpenShift.
 +
 +==== 6.8 KAI Scheduler — the open-source path ====
 +
 +Since full Run:ai requires a license, the open-source **KAI Scheduler** is the practical way to drill the scheduling concepts on your own cluster.
 +
 +<code bash>
 +# Install on an existing Kubernetes cluster
 +helm repo add nvidia-kai-scheduler https://nvidia.github.io/KAI-Scheduler
 +helm install kai-scheduler nvidia-kai-scheduler/kai-scheduler \
 +  -n kai-scheduler --create-namespace
 +
 +# Verify
 +kubectl get pods -n kai-scheduler
 +</code>
 +
 +What you can practice with KAI alone (no Run:ai license needed):
 +
 +  * Queue hierarchies and weighted fair-share
 +  * Gang scheduling for distributed pods
 +  * Pod priority and preemption
 +  * Resource quotas across queues
 +  * Even without GPUs — use CPU-only pods to practice the scheduler logic
 +
 +==== 6.9 Practice scenario ====
 +
 +> A data science team complains their workspaces keep getting killed at 3 AM when training jobs run. Diagnose.
 +
 +This is **expected behavior**, not a bug. Workspaces are low-priority and preemptible; training jobs are higher priority. When a training job submitted by another project wants GPUs that the data science project is using only as over-quota, Run:ai preempts the workspace. Options:
 +
 +  - Increase the data science project's **deserved quota** so the workspaces are no longer in over-quota territory.
 +  - Move long-running data science work into **training-type workloads** (higher priority than workspaces).
 +  - Add **node pool** segregation — a dedicated pool for interactive work that batch training jobs can't preempt into.
 +  - Schedule batch training jobs to a **different time window** via job priority/scheduling rules.
  
 ---- ----
Line 691: Line 1000:
   * **Slurm reason codes** — ''Resources'', ''Priority'', ''ReqNodeNotAvail'', ''AssocGrpGRESLimit'', ''QOSMaxGRESPerUser'', ''ReqGresTypeNotAvail''.   * **Slurm reason codes** — ''Resources'', ''Priority'', ''ReqNodeNotAvail'', ''AssocGrpGRESLimit'', ''QOSMaxGRESPerUser'', ''ReqGresTypeNotAvail''.
   * **NCCL env vars** — ''NCCL_DEBUG'', ''NCCL_IB_HCA'', ''NCCL_SOCKET_IFNAME''.   * **NCCL env vars** — ''NCCL_DEBUG'', ''NCCL_IB_HCA'', ''NCCL_SOCKET_IFNAME''.
 +  * **NGC auth — username is literally ''%%$oauthtoken%%''**, password is the NGC API key. This trips people up every exam cycle.
 +  * **Docker ''%%--gpus%%'' flag variants** — ''all'', ''2'', ''%%"device=0,2"%%'', by UUID, by MIG UUID.
 +  * **The "could not select device driver" error** = NVIDIA Container Toolkit missing or no GPU on host.
 +  * **Run:ai quota model** — //deserved// is guaranteed, //over-quota// is preemptible. Inference > Training > Workspace in priority.
 +  * **Fractional GPU ≠ MIG** — Fractional is software-enforced and works on any NVIDIA GPU; MIG is hardware-enforced and only on A100/H100/etc.
 +  * **runai submit flags** — ''%%--gpu 1%%'' (whole), ''%%--gpu-memory 4G%%'' (fraction by memory), ''%%--gpu-fraction 0.5%%'' (fraction by share).
  
 Good luck — build the labs, break them on purpose, fix them, repeat. Good luck — build the labs, break them on purpose, fix them, repeat.
wiki/ai/ncp-aio-lab-training.1779120496.txt.gz · Last modified: by ddehamer