User Tools

Site Tools


wiki:ai:study_materials_for_nvidia_certified_associate_-_ai_infrastructure_operations
Ready for approval 2026/02/17 23:03 by mcarver | Approver: @ai-us-principals

Module 1: Fundamentals

  • Drivers of Evolution
    • Increase in amount of Data
    • Computational power growth
    • Algorithm breakthroughs
  • AI use cases across industries
    • Automotive
      • Real time object detection
      • Autonomous decision making
      • Simulation for design/testing
    • Healthcare
      • Automated medical image analysis
      • Genetics pipelines, anomaly detection
      • Clinical inference with low latency
    • Surveillance
      • Real-time video stream analysis
      • Object detection & tracking
      • Threat detection, Multi-camera stream processing
    • Finance & Banking
      • Real-time fraud detection
      • Transaction scoring at scale
      • Ultra-low latency inference
    • Retail
      • Demand forecasting
      • Supply-chain optimization
      • Inventory management
    • Manufacturing
      • Automated quality control
      • Predictive simulations, supply chain logistics
  • AI, ML, DL, Gen AI
    • AI = Artificial Intelligence (ex. Machines that can play chess based on rules)
    • Machine simulating human intelligence and decision making
    • ML = Machine Learning (ex. Machines that learn to play chess from analyzing past chess games played by humans)
    • Ability of machine to learn without explicitly being programmed
    • DL = Deep Learning (ex. Machines that learn to play chess by playing against themselves)
    • Ability of machines to process data in a way that is inspired by the human brain (neural networks)
    • Gen AI = Generative AI (ex. Machines that can create a new game based on rules of chess and given prompts)
    • Creating new content based on prompts
  • Transformer model
    • “Attention” enabled models to scale the understanding of relationships between words
    • Efficiently use parallel computing

Module 2: Inside an AI Centric Data Center

  • Compute
    • Moore’s law not holding true, cannot double number of transistors every 2 years due to harder scaling and physical limitations
    • Processing power
      • CPU (Central Processing Unit)
        • Few, complex, powerful
        • Flexible, general purpose tasks
        • Serial processing
        • Low latency, quick response
        • Best for small, complex, varied tasks
        • OS, general computing, apps
      • GPU (Graphic Processing Unit)
        • Many, hundreds-thousands
        • Optimized for parallel tasks
        • High throughput, bulk processing
        • Best for large, repetitive, parallel tasks
        • Graphics rendering, AI, simulations, mining
      • DPU (Data Processing Unit)
        • CPUs and GPUs compute, but DPUs make it possible
        • Data centric tasks
        • Packet processing, load balancing, overlay/underlay networking, RDMA
        • Storage compression, encryption
        • Security firewalls, packet inspection, IPSec, etc.
  • Network
    • Communication between data center components
    • Compute network
      • GPU to GPU communication
      • InfiniBand, NVLink, High bandwidth
      • Extremely high throughput
    • Storage network
      • Connects compute nodes to storage applicances
      • Supports file systems, checkpoints, I/O traffic
      • InfiniBand or Ethernet
      • Multi-GB/s throughput per node
    • In-Band Mgmt. Network
      • Handles control-plane traffic, cluster management line SSH, DNS, Job scheduling
      • Provides access to code repo, external repos
    • Out-of-Band Mgmt. Network
      • Provides remote management function (power control, serial console) even if servers are off
      • Used for recovery
      • Separate physical ports & low speed switches
      • Must always be available and redundant
      • Strong access control and security are essential
      • BMC – Baseboard Management Controller
  • Ethernet vs. InfiniBand
    • Not one or the other, can compliment
    • Ethernet
      • Like a highway, general purpose, good but bottlenecks can occur
      • General purpose networking, LANs, WANs, internet
      • Higher latency
      • 1Gb/s – 400 Gb/s
      • Uses TCP/IP
      • Cheaper
      • Universal
    • InfiniBand
      • Like bullet train, high speed, specific routes
      • Niche but essential for HPC and AI clusters
      • 10 Gb/s – 400 Gb/s
      • Extremely low latency
      • Uses RDMA (Remote Direct Memory Access)
      • More expensive, specialized
      • Specialized drivers and hardware
    • Converged Ethernet
      • LAN, SAN, HPC in one fabric
      • Higher bandwidth
      • Lower power usage
      • Can use RDMA
      • More cost efficient
  • Storage
    • AI Workloads demand high throughput, low latency, and scalability
    • NVMe SSD (Local storage)
    • Parallel File systems (Clustered storage)
      • Shared, high speed, access across many nodes in the cluster
    • Network File Systems (Network Storage)
      • Distributing small datasets, configuration, and scripts across nodes
    • Object storage
      • Long term storage for massive raw data sets, etc
  • Cloud vs. On-Prem
    • Cloud is low cost of entry vs On-prem has higher data security & sovereignty
    • Pay as you go vs high upfront cost
    • Scalability vs hardware limitations
    • Compliance considerations
  • Support Infrastructure
    • Power
      • More processing power (90% vs 50% in traditional data center)
    • Cooling
    • Security etc.
    • PUE (Power Usage Effectiveness)
      • Metric that compares the total energy consumed by a data center to the energy consumed by IT equipment
      • PUE = Total Facility Energy / IT Equipment Energy
      • Helps measure data center energy efficiency
      • Guides optimization in cooling, power distribution, and facility design
      • Lower = better, 1.2 highly efficient, 2.0 is worse

Module 3: NVIDIA Technology Stack

  • Layer 1: Physical Layer
    • NVIDIA RTX
      • Gaming and workstation GPUs
    • DGX Platform
      • Data center AI server
    • GPU Cores
      • CUDA Core
        • “Regular Teacher” versatile but not specialized
      • Tensor Core
        • “Math Teacher” math and AI tasks
      • Ray Tracing Core
        • “Art Teacher” graphic rendering and ray tracing
    • DGX A100 vs. DGX H100/H200
      • Both 8 tensor core GPUs
      • A100 = 80 GB / GPU = 640 GB total – Dual AMD and 1 TB
      • H100 = 80 GB / GPU = 640 GB total – Dual Intel and 2 TB
      • H200 = 141 GB / GPU = 1128 GB total – Dual Intel and 2 TB
    • DGX SuperPOD
      • AI Supercomputer
    • ConnectX InfiniBand HCAs / NICs
      • Networking Interface
    • Bluefield / SuperNICs
      • DPUs
  • Layer 2: Data Management & I/O Acceleration
    • NVLink
      • GPU Interconnect
    • RDMA, Storage
    • GPU Direct
    • InfiniBand, OpenSM
      • HPC Fabric
  • Layer 3: OS, Driver & Virtualization
    • DGX OS
      • Operating System - Ubuntu
    • GPU Drivers
    • vGPU / MIG
      • GPU Virtualization
  • Layer 4: Core Libraries
    • CUDA
      • GPU Programming
    • NCCL
      • GPU Communication
  • Layer 5: Monitoring & Management
    • NVIDIA-smi
    • DCGM
    • Base Command Manager
  • Layer 6: Applications & Vertical Solutions
    • Clara
      • Healthcare & Hospitals
    • Merlin
      • Recommendation Systems
    • NVIDIA NIMs
      • Inference Microservices
  • Integrations
    • Containerization
      • Docker
      • Kubernetes
    • ML Frameworks
      • TensorFlow
      • PyTorch
    • Workload Management
      • SLURM
    • Monitoring
      • Prometheus
      • Grafana
  • Vendors/Partners
  • NVIDIA Tools
    • Nvidia-smi
      • Check status on single system
      • Quick troubleshooting
      • No setup required
      • Immediate results
    • DCGM
      • Monitoring 10+ GPU nodes
      • Historical metrics
      • Alerting/diagnostics
      • Kubernetes GPU management
    • Base Command Manager
      • Managing entire AI data center
      • Job scheduling + monitoring
      • Multi-team/multi-user env
      • Enterprise-scale operation
    • OpenSM
      • Enables InfiniBand Subnet Management
  • NVIDIA Solutions
    • CPU
      • Grace
    • GPU
      • Hopper
      • Blackwell
    • “Chips”
      • Grace Hopper
      • Grace Blackwell
    • NVIDIA AI Enterprise
      • OS for enterprise AI
      • Suite of software that gives companies all the tools they need for full stack NVIDIA AI solutions
      • Drivers, frameworks, prebuild models, services
    • NVIDIA AI Factory
      • AI-focused data center
      • Build, train, deploy AI models at scale
      • Takes in data, processes, produces models or inference result
      • Entire AI lifecycle

Module 4 – AI Workflows

  • Data processing
    • Procuring, augmenting, cleaning, transforming data
    • NVIDIA RAPIDS
  • Model Training
    • Teaching a model using processed data so it learns patterns and behaviors
    • PyTorch
    • TensorFlow
    • PyTorch and TensorFlow are machine learning frameworks. Sets of tools, libraries, and prewritten code that helps you build, train, test machine learning models more easily
    • Provides building blocks, hardware acceleration, and utilities for loading data, saving models, etc.
  • Model Optimization
    • Fine tuning using quantization, pruning for better performance
    • NVIDIA TensorRT
  • Inferencing/Deployment
    • Run the optimized AI model in production to make accurate predictions on new inputs
    • NVIDIA Triton
    • Inference server
  • NVIDIA Differentiator
    • Python
    • Framework engine (PyTorch/TensorFlow)
    • cuDNN (Optimization layer)
    • CUDA (framework talks to GPU)
  • Model Training vs. Model Inference
    • Model Training
      • Initial teaching of model
      • Uses large dataset and parameters
      • Multiple iterations
      • High compute power, multiple GPUs (often)
      • Forward pass + backward pass + weight updates = high compute
      • More memory for model weights, optimizer states, gradients
      • Larger batch sizes increases memory demand
      • Scales horizontally across GPUs/nodes more expensive and time consuming process
    • Model Inference
      • Continuous learning of model based on its outputs from new, unseen data
      • Low latency and high throughput are priorities
      • Compute lighter since only forward pass is needed
      • Focus on response time and efficiency
      • Less memory, as model is often optimized with compression and quantization
      • Scales elastically based on demand
  • Job Scheduling vs. Container Orchestration
    • Job Scheduling
      • Aligned with training
      • Uses SLURM
      • Like an air traffic controller
    • Container Orchestration
      • Aligned with inference
      • System control/monitoring
      • Like a smart city traffic system
      • Load balancing, autoscaling
      • Uses Kubernetes
  • SLURM vs. Kubernetes
    • SLURM
      • Job scheduling
      • Resource allocation and batch job management
      • HPC, AI training, data processing
      • Static jobs, queued execution
      • Priority queue, resource quota
      • CUDA-aware, multi-GPU aware
      • Command line
      • Researchers, data scientists, HPC admins
      • Health and performance
      • RDMA for direct GPU to GPU memory transfers, reducing latency
    • Kubernetes
      • Container Orchestration
      • Lifecycle management
      • AI inference, microservices, data pipelines
      • Dynamic pods, continuous service
      • Always-on or auto-scaled services
      • Load balancing, replica scaling
      • Scales containerization workloads across clusters
      • API-driven
      • DevOps, MLOps, AI platform engineers
      • MIG-aware scheduling
      • Uses GPU operator to install GPU drivers, CUDA, and DCGM automatically
  • Machine Learning Operations (MLOps)
    • Tools, processes, and best practices for end-to-end machine learning system development and operations in production
    • Model documentation and versioning
    • Data tracking
    • Standardization
    • Monitoring
    • Consistency of results
  • NVIDIA Tools for MLOps
    • Data Prep
      • RAPIDS, NVTabular, NeMo Data Curator
    • Model Training
      • NVIDIA AI Enterprise
      • Base Command Platform
      • DGX Cloud
      • PyTorch/TensorFlow (CUDA)
    • Model Optimization
      • TensorRT
      • TAO Toolkit
    • Deployment and Inference
      • NVIDIA Triton Inference Server
      • NVIDIA NIM Microservices
      • Fleet Command
    • Monitoring and Management
      • NVIDIA Base Command Manager
      • Fleet Command
      • NGC Registry
    • Continuous Learning/ Updates
      • NOC Workflows
      • TAO Toolkit
      • NeMo Framework
wiki/ai/study_materials_for_nvidia_certified_associate_-_ai_infrastructure_operations.txt · Last modified: by mcarver