Ready for approval 2026/02/17 23:03 by mcarver | Approver: @ai-us-principals

Module 1: Fundamentals

Drivers of Evolution
- Increase in amount of Data
- Computational power growth
- Algorithm breakthroughs
AI use cases across industries
- Automotive
  - Real time object detection
  - Autonomous decision making
  - Simulation for design/testing
- Healthcare
  - Automated medical image analysis
  - Genetics pipelines, anomaly detection
  - Clinical inference with low latency
- Surveillance
  - Real-time video stream analysis
  - Object detection & tracking
  - Threat detection, Multi-camera stream processing
- Finance & Banking
  - Real-time fraud detection
  - Transaction scoring at scale
  - Ultra-low latency inference
- Retail
  - Demand forecasting
  - Supply-chain optimization
  - Inventory management
- Manufacturing
  - Automated quality control
  - Predictive simulations, supply chain logistics
AI, ML, DL, Gen AI
- AI = Artificial Intelligence (ex. Machines that can play chess based on rules)
- Machine simulating human intelligence and decision making
- ML = Machine Learning (ex. Machines that learn to play chess from analyzing past chess games played by humans)
- Ability of machine to learn without explicitly being programmed
- DL = Deep Learning (ex. Machines that learn to play chess by playing against themselves)
- Ability of machines to process data in a way that is inspired by the human brain (neural networks)
- Gen AI = Generative AI (ex. Machines that can create a new game based on rules of chess and given prompts)
- Creating new content based on prompts
Transformer model
- “Attention” enabled models to scale the understanding of relationships between words
- Efficiently use parallel computing

Module 2: Inside an AI Centric Data Center

Compute
- Moore’s law not holding true, cannot double number of transistors every 2 years due to harder scaling and physical limitations
- Processing power
  - CPU (Central Processing Unit)
    - Few, complex, powerful
    - Flexible, general purpose tasks
    - Serial processing
    - Low latency, quick response
    - Best for small, complex, varied tasks
    - OS, general computing, apps
  - GPU (Graphic Processing Unit)
    - Many, hundreds-thousands
    - Optimized for parallel tasks
    - High throughput, bulk processing
    - Best for large, repetitive, parallel tasks
    - Graphics rendering, AI, simulations, mining
  - DPU (Data Processing Unit)
    - CPUs and GPUs compute, but DPUs make it possible
    - Data centric tasks
    - Packet processing, load balancing, overlay/underlay networking, RDMA
    - Storage compression, encryption
    - Security firewalls, packet inspection, IPSec, etc.
Network
- Communication between data center components
- Compute network
  - GPU to GPU communication
  - InfiniBand, NVLink, High bandwidth
  - Extremely high throughput
- Storage network
  - Connects compute nodes to storage applicances
  - Supports file systems, checkpoints, I/O traffic
  - InfiniBand or Ethernet
  - Multi-GB/s throughput per node
- In-Band Mgmt. Network
  - Handles control-plane traffic, cluster management line SSH, DNS, Job scheduling
  - Provides access to code repo, external repos
- Out-of-Band Mgmt. Network
  - Provides remote management function (power control, serial console) even if servers are off
  - Used for recovery
  - Separate physical ports & low speed switches
  - Must always be available and redundant
  - Strong access control and security are essential
  - BMC – Baseboard Management Controller
Ethernet vs. InfiniBand
- Not one or the other, can compliment
- Ethernet
  - Like a highway, general purpose, good but bottlenecks can occur
  - General purpose networking, LANs, WANs, internet
  - Higher latency
  - 1Gb/s – 400 Gb/s
  - Uses TCP/IP
  - Cheaper
  - Universal
- InfiniBand
  - Like bullet train, high speed, specific routes
  - Niche but essential for HPC and AI clusters
  - 10 Gb/s – 400 Gb/s
  - Extremely low latency
  - Uses RDMA (Remote Direct Memory Access)
  - More expensive, specialized
  - Specialized drivers and hardware
- Converged Ethernet
  - LAN, SAN, HPC in one fabric
  - Higher bandwidth
  - Lower power usage
  - Can use RDMA
  - More cost efficient
Storage
- AI Workloads demand high throughput, low latency, and scalability
- NVMe SSD (Local storage)
- Parallel File systems (Clustered storage)
  - Shared, high speed, access across many nodes in the cluster
- Network File Systems (Network Storage)
  - Distributing small datasets, configuration, and scripts across nodes
- Object storage
  - Long term storage for massive raw data sets, etc
Cloud vs. On-Prem
- Cloud is low cost of entry vs On-prem has higher data security & sovereignty
- Pay as you go vs high upfront cost
- Scalability vs hardware limitations
- Compliance considerations
Support Infrastructure
- Power
  - More processing power (90% vs 50% in traditional data center)
- Cooling
- Security etc.
- PUE (Power Usage Effectiveness)
  - Metric that compares the total energy consumed by a data center to the energy consumed by IT equipment
  - PUE = Total Facility Energy / IT Equipment Energy
  - Helps measure data center energy efficiency
  - Guides optimization in cooling, power distribution, and facility design
  - Lower = better, 1.2 highly efficient, 2.0 is worse

Module 3: NVIDIA Technology Stack

Layer 1: Physical Layer
- NVIDIA RTX
  - Gaming and workstation GPUs
- DGX Platform
  - Data center AI server
- GPU Cores
  - CUDA Core
    - “Regular Teacher” versatile but not specialized
  - Tensor Core
    - “Math Teacher” math and AI tasks
  - Ray Tracing Core
    - “Art Teacher” graphic rendering and ray tracing
- DGX A100 vs. DGX H100/H200
  - Both 8 tensor core GPUs
  - A100 = 80 GB / GPU = 640 GB total – Dual AMD and 1 TB
  - H100 = 80 GB / GPU = 640 GB total – Dual Intel and 2 TB
  - H200 = 141 GB / GPU = 1128 GB total – Dual Intel and 2 TB
- DGX SuperPOD
  - AI Supercomputer
- ConnectX InfiniBand HCAs / NICs
  - Networking Interface
- Bluefield / SuperNICs
  - DPUs
Layer 2: Data Management & I/O Acceleration
- NVLink
  - GPU Interconnect
- RDMA, Storage
- GPU Direct
- InfiniBand, OpenSM
  - HPC Fabric
Layer 3: OS, Driver & Virtualization
- DGX OS
  - Operating System - Ubuntu
- GPU Drivers
- vGPU / MIG
  - GPU Virtualization
Layer 4: Core Libraries
- CUDA
  - GPU Programming
- NCCL
  - GPU Communication
Layer 5: Monitoring & Management
- NVIDIA-smi
- DCGM
- Base Command Manager
Layer 6: Applications & Vertical Solutions
- Clara
  - Healthcare & Hospitals
- Merlin
  - Recommendation Systems
- NVIDIA NIMs
  - Inference Microservices
Integrations
- Containerization
  - Docker
  - Kubernetes
- ML Frameworks
  - TensorFlow
  - PyTorch
- Workload Management
  - SLURM
- Monitoring
  - Prometheus
  - Grafana
Vendors/Partners
NVIDIA Tools
- Nvidia-smi
  - Check status on single system
  - Quick troubleshooting
  - No setup required
  - Immediate results
- DCGM
  - Monitoring 10+ GPU nodes
  - Historical metrics
  - Alerting/diagnostics
  - Kubernetes GPU management
- Base Command Manager
  - Managing entire AI data center
  - Job scheduling + monitoring
  - Multi-team/multi-user env
  - Enterprise-scale operation
- OpenSM
  - Enables InfiniBand Subnet Management
NVIDIA Solutions
- CPU
  - Grace
- GPU
  - Hopper
  - Blackwell
- “Chips”
  - Grace Hopper
  - Grace Blackwell
- NVIDIA AI Enterprise
  - OS for enterprise AI
  - Suite of software that gives companies all the tools they need for full stack NVIDIA AI solutions
  - Drivers, frameworks, prebuild models, services
- NVIDIA AI Factory
  - AI-focused data center
  - Build, train, deploy AI models at scale
  - Takes in data, processes, produces models or inference result
  - Entire AI lifecycle

Module 4 – AI Workflows

Data processing
- Procuring, augmenting, cleaning, transforming data
- NVIDIA RAPIDS
Model Training
- Teaching a model using processed data so it learns patterns and behaviors
- PyTorch
- TensorFlow
- PyTorch and TensorFlow are machine learning frameworks. Sets of tools, libraries, and prewritten code that helps you build, train, test machine learning models more easily
- Provides building blocks, hardware acceleration, and utilities for loading data, saving models, etc.
Model Optimization
- Fine tuning using quantization, pruning for better performance
- NVIDIA TensorRT
Inferencing/Deployment
- Run the optimized AI model in production to make accurate predictions on new inputs
- NVIDIA Triton
- Inference server
NVIDIA Differentiator
- Python
- Framework engine (PyTorch/TensorFlow)
- cuDNN (Optimization layer)
- CUDA (framework talks to GPU)
Model Training vs. Model Inference
- Model Training
  - Initial teaching of model
  - Uses large dataset and parameters
  - Multiple iterations
  - High compute power, multiple GPUs (often)
  - Forward pass + backward pass + weight updates = high compute
  - More memory for model weights, optimizer states, gradients
  - Larger batch sizes increases memory demand
  - Scales horizontally across GPUs/nodes more expensive and time consuming process
- Model Inference
  - Continuous learning of model based on its outputs from new, unseen data
  - Low latency and high throughput are priorities
  - Compute lighter since only forward pass is needed
  - Focus on response time and efficiency
  - Less memory, as model is often optimized with compression and quantization
  - Scales elastically based on demand
Job Scheduling vs. Container Orchestration
- Job Scheduling
  - Aligned with training
  - Uses SLURM
  - Like an air traffic controller
- Container Orchestration
  - Aligned with inference
  - System control/monitoring
  - Like a smart city traffic system
  - Load balancing, autoscaling
  - Uses Kubernetes
SLURM vs. Kubernetes
- SLURM
  - Job scheduling
  - Resource allocation and batch job management
  - HPC, AI training, data processing
  - Static jobs, queued execution
  - Priority queue, resource quota
  - CUDA-aware, multi-GPU aware
  - Command line
  - Researchers, data scientists, HPC admins
  - Health and performance
  - RDMA for direct GPU to GPU memory transfers, reducing latency
- Kubernetes
  - Container Orchestration
  - Lifecycle management
  - AI inference, microservices, data pipelines
  - Dynamic pods, continuous service
  - Always-on or auto-scaled services
  - Load balancing, replica scaling
  - Scales containerization workloads across clusters
  - API-driven
  - DevOps, MLOps, AI platform engineers
  - MIG-aware scheduling
  - Uses GPU operator to install GPU drivers, CUDA, and DCGM automatically
Machine Learning Operations (MLOps)
- Tools, processes, and best practices for end-to-end machine learning system development and operations in production
- Model documentation and versioning
- Data tracking
- Standardization
- Monitoring
- Consistency of results
NVIDIA Tools for MLOps
- Data Prep
  - RAPIDS, NVTabular, NeMo Data Curator
- Model Training
  - NVIDIA AI Enterprise
  - Base Command Platform
  - DGX Cloud
  - PyTorch/TensorFlow (CUDA)
- Model Optimization
  - TensorRT
  - TAO Toolkit
- Deployment and Inference
  - NVIDIA Triton Inference Server
  - NVIDIA NIM Microservices
  - Fleet Command
- Monitoring and Management
  - NVIDIA Base Command Manager
  - Fleet Command
  - NGC Registry
- Continuous Learning/ Updates
  - NOC Workflows
  - TAO Toolkit
  - NeMo Framework

Combined Cloud Managed Services

Table of Contents

Module 1: Fundamentals

Module 2: Inside an AI Centric Data Center

Module 3: NVIDIA Technology Stack

Module 4 – AI Workflows

Combined Cloud Managed Services

Site Tools

Table of Contents

Module 1: Fundamentals

Module 2: Inside an AI Centric Data Center

Module 3: NVIDIA Technology Stack

Module 4 – AI Workflows

Page Tools