Security firewalls, packet inspection, IPSec, etc.
Network
Communication between data center components
Compute network
GPU to GPU communication
InfiniBand, NVLink, High bandwidth
Extremely high throughput
Storage network
Connects compute nodes to storage applicances
Supports file systems, checkpoints, I/O traffic
InfiniBand or Ethernet
Multi-GB/s throughput per node
In-Band Mgmt. Network
Handles control-plane traffic, cluster management line SSH, DNS, Job scheduling
Provides access to code repo, external repos
Out-of-Band Mgmt. Network
Provides remote management function (power control, serial console) even if servers are off
Used for recovery
Separate physical ports & low speed switches
Must always be available and redundant
Strong access control and security are essential
BMC – Baseboard Management Controller
Ethernet vs. InfiniBand
Not one or the other, can compliment
Ethernet
Like a highway, general purpose, good but bottlenecks can occur
General purpose networking, LANs, WANs, internet
Higher latency
1Gb/s – 400 Gb/s
Uses TCP/IP
Cheaper
Universal
InfiniBand
Like bullet train, high speed, specific routes
Niche but essential for HPC and AI clusters
10 Gb/s – 400 Gb/s
Extremely low latency
Uses RDMA (Remote Direct Memory Access)
More expensive, specialized
Specialized drivers and hardware
Converged Ethernet
LAN, SAN, HPC in one fabric
Higher bandwidth
Lower power usage
Can use RDMA
More cost efficient
Storage
AI Workloads demand high throughput, low latency, and scalability
NVMe SSD (Local storage)
Parallel File systems (Clustered storage)
Shared, high speed, access across many nodes in the cluster
Network File Systems (Network Storage)
Distributing small datasets, configuration, and scripts across nodes
Object storage
Long term storage for massive raw data sets, etc
Cloud vs. On-Prem
Cloud is low cost of entry vs On-prem has higher data security & sovereignty
Pay as you go vs high upfront cost
Scalability vs hardware limitations
Compliance considerations
Support Infrastructure
Power
More processing power (90% vs 50% in traditional data center)
Cooling
Security etc.
PUE (Power Usage Effectiveness)
Metric that compares the total energy consumed by a data center to the energy consumed by IT equipment
PUE = Total Facility Energy / IT Equipment Energy
Helps measure data center energy efficiency
Guides optimization in cooling, power distribution, and facility design
Lower = better, 1.2 highly efficient, 2.0 is worse
Module 3: NVIDIA Technology Stack
Layer 1: Physical Layer
NVIDIA RTX
Gaming and workstation GPUs
DGX Platform
Data center AI server
GPU Cores
CUDA Core
“Regular Teacher” versatile but not specialized
Tensor Core
“Math Teacher” math and AI tasks
Ray Tracing Core
“Art Teacher” graphic rendering and ray tracing
DGX A100 vs. DGX H100/H200
Both 8 tensor core GPUs
A100 = 80 GB / GPU = 640 GB total – Dual AMD and 1 TB
H100 = 80 GB / GPU = 640 GB total – Dual Intel and 2 TB
H200 = 141 GB / GPU = 1128 GB total – Dual Intel and 2 TB
DGX SuperPOD
AI Supercomputer
ConnectX InfiniBand HCAs / NICs
Networking Interface
Bluefield / SuperNICs
DPUs
Layer 2: Data Management & I/O Acceleration
NVLink
GPU Interconnect
RDMA, Storage
GPU Direct
InfiniBand, OpenSM
HPC Fabric
Layer 3: OS, Driver & Virtualization
DGX OS
Operating System - Ubuntu
GPU Drivers
vGPU / MIG
GPU Virtualization
Layer 4: Core Libraries
CUDA
GPU Programming
NCCL
GPU Communication
Layer 5: Monitoring & Management
NVIDIA-smi
DCGM
Base Command Manager
Layer 6: Applications & Vertical Solutions
Clara
Healthcare & Hospitals
Merlin
Recommendation Systems
NVIDIA NIMs
Inference Microservices
Integrations
Containerization
Docker
Kubernetes
ML Frameworks
TensorFlow
PyTorch
Workload Management
SLURM
Monitoring
Prometheus
Grafana
Vendors/Partners
NVIDIA Tools
Nvidia-smi
Check status on single system
Quick troubleshooting
No setup required
Immediate results
DCGM
Monitoring 10+ GPU nodes
Historical metrics
Alerting/diagnostics
Kubernetes GPU management
Base Command Manager
Managing entire AI data center
Job scheduling + monitoring
Multi-team/multi-user env
Enterprise-scale operation
OpenSM
Enables InfiniBand Subnet Management
NVIDIA Solutions
CPU
Grace
GPU
Hopper
Blackwell
“Chips”
Grace Hopper
Grace Blackwell
NVIDIA AI Enterprise
OS for enterprise AI
Suite of software that gives companies all the tools they need for full stack NVIDIA AI solutions
Drivers, frameworks, prebuild models, services
NVIDIA AI Factory
AI-focused data center
Build, train, deploy AI models at scale
Takes in data, processes, produces models or inference result
Entire AI lifecycle
Module 4 – AI Workflows
Data processing
Procuring, augmenting, cleaning, transforming data
NVIDIA RAPIDS
Model Training
Teaching a model using processed data so it learns patterns and behaviors
PyTorch
TensorFlow
PyTorch and TensorFlow are machine learning frameworks. Sets of tools, libraries, and prewritten code that helps you build, train, test machine learning models more easily
Provides building blocks, hardware acceleration, and utilities for loading data, saving models, etc.
Model Optimization
Fine tuning using quantization, pruning for better performance
NVIDIA TensorRT
Inferencing/Deployment
Run the optimized AI model in production to make accurate predictions on new inputs