This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| wiki:ai:gpu_firmware_upgrade [2026/05/08 17:25] – smadanagopal | wiki:ai:gpu_firmware_upgrade [2026/05/12 10:00] (current) – smadanagopal | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== NVIDIA GPU Firmware Upgrade Runbook ====== | + | ====== NVIDIA GPU Firmware Upgrade Runbook |
| ===== Process at a Glance ===== | ===== Process at a Glance ===== | ||
| Line 5: | Line 5: | ||
| - **Prepare** — record current versions, verify GPU health, stop all workloads | - **Prepare** — record current versions, verify GPU health, stop all workloads | ||
| - **Validate Package** — checksum the firmware file, preview what will change | - **Validate Package** — checksum the firmware file, preview what will change | ||
| - | - **Flash GPU Tray** — update VBIOS, NVSwitch, EROT, FPGA via nvfwupd | + | - **Phase 2 (in this exact order):** |
| - | - **Reboot** — wait for background copy to complete, then cold reboot | + | - BMC firmware |
| - | - **Flash Motherboard Tray** — update BMC, BIOS, PSUs; reset BMC | + | - BMC reset (mc reset cold) |
| + | - Motherboard tray update (SBIOS, CPLD, PSU, etc.) | ||
| + | - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) | ||
| + | - Poll BackgroundCopyStatus until **Completed** | ||
| + | - Chassis power cycle (graceful, via Redfish/ | ||
| + | - Re-check versions on staged components | ||
| + | - CommitImage on **HostBMC_0** and **HostBIOS_0** | ||
| + | - **AC power cycle** (full PDU/cord removal, 5–6 min) | ||
| + | - NIC firmware via '' | ||
| - **Validate** — confirm new versions, GPU health, NVLink, smoke test | - **Validate** — confirm new versions, GPU health, NVLink, smoke test | ||
| - | **NOTE:** __Flashing | + | <note warning> |
| + | Flashing | ||
| + | |||
| + | **Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. | ||
| + | </ | ||
| ---- | ---- | ||
| Line 16: | Line 28: | ||
| ===== 1. Overview ===== | ===== 1. Overview ===== | ||
| - | This runbook covers the end-to-end firmware upgrade process for NVIDIA | + | This runbook covers the end-to-end firmware upgrade process for NVIDIA |
| + | |||
| + | **Scope — what this runbook covers:** | ||
| + | * **Hopper:** DGX H100/H200, HGX H100/H200 | ||
| + | * **Blackwell: | ||
| + | * **Standalone PCIe GPUs** via NVFlash | ||
| + | * Components: VBIOS, NVSwitch, EROT, FPGA, BMC, SBIOS, CPLD, PSU, NIC, TPM, NVMe, and (Blackwell only) Grace CPU FW, NVLink Switch tray FW, ConnectX-7/ | ||
| + | |||
| + | **Blackwell-specific notes (GB200 / B200):** | ||
| + | * GB200 NVL72 racks are driven from the **NVIDIA Mission Control** rack-firmware tooling. Refer to the Mission Control rack-firmware document shipped with the bundle | ||
| + | * Bundles arrive as **two NVIDIA-released packages** plus the NVLink Switch tray package: '' | ||
| + | * **Ordering, simplified: | ||
| + | * **Activation: | ||
| + | * The final mandatory **rack-level AC cycle** for EROT/CPLD still applies — for GB200 NVL72 this means power-cycling the full rack PDU feeds (coordinate with DC ops). | ||
| + | * **nvfwupd floor:** 2.0.4 for Hopper; **2.0.5 for Blackwell** (HMC + Grace + NVLink Switch tray Redfish paths), or **2.0.9** if you intend to run parallel multi-package updates across the rack. | ||
| + | |||
| + | **Scope — what this runbook does NOT cover:** | ||
| + | * DGX A100 (use the Firmware Update Container — separate runbook) | ||
| + | * Standalone Mellanox/ | ||
| + | * Quantum/ | ||
| + | * Storage controller / RAID HBA firmware | ||
| ---- | ---- | ||
| Line 22: | Line 54: | ||
| ===== 2. Background ===== | ===== 2. Background ===== | ||
| - | ==== What Is GPU Firmware? | + | ==== Components Updated |
| - | Firmware is low-level | + | ^ Component |
| + | | BMC | Out-of-band server management | ||
| + | | SBIOS / UEFI | Motherboard initialisation | ||
| + | | CPLD | Board-level | ||
| + | | VBIOS | GPU boot, clocks, power limits, PCIe config | ||
| + | | NVSwitch FW | GPU-to-GPU interconnect (NVLink fabric) | ||
| + | | EROT | Hardware root of trust / secure boot | **AC power cycle (mandatory)** | ||
| + | | FPGA | Power management | ||
| + | | PSU | Power supply unit firmware | ||
| + | | NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/ | ||
| + | | TPM | Trusted Platform Module | ||
| + | | NVMe | Onboard storage firmware | ||
| - | **Components updated in a full DGX/HGX firmware upgrade:** | + | <note important> |
| + | EROT and CPLD images are staged into a secondary flash region by the update, but only activate on a full AC power removal — a warm reboot or chassis power cycle is not sufficient. | ||
| + | </note> | ||
| - | ^ Component | + | ==== Driver Compatibility ==== |
| - | | VBIOS | GPU boot, clocks, power limits, PCIe config | + | |
| - | | NVSwitch | + | |
| - | | EROT | Hardware root of trust / secure boot | | + | |
| - | | FPGA | Power management and control plane | | + | |
| - | | BMC | Out-of-band server management | + | |
| - | | BIOS / UEFI | Motherboard initialisation | + | |
| - | | PSU | Power supply unit firmware | + | |
| - | ==== Why Upgrade? ==== | + | The minimum driver depends on the **firmware package version**, not the GPU architecture alone. Refer to the **per-package release notes** shipped with the .fwpkg you intend to install. Do not rely on generic architecture → driver tables; they go stale quickly as new packages add features and ABI requirements. |
| - | * Fix GPU stability bugs or uncorrectable memory errors | + | ==== NVSwitch Chip vs NVLink |
| - | * Apply security patches to the hardware trust chain | + | |
| - | * Meet minimum firmware requirements for new driver versions | + | |
| - | * Enable features such as Confidential Computing or NVLink | + | |
| - | ==== Minimum Driver Requirements ==== | + | * **NVSwitch chip** — silicon on the GPU baseboard. Inventory path: ''/ |
| + | * **NVLink Switch tray** — separate rack-level switch chassis in GB200 NVL72 racks. Inventory path: ''/ | ||
| - | ^ | + | Selecting the wrong target will return a Redfish 404 and silently skip the component. |
| - | | Ampere | + | |
| - | | Hopper | + | ==== Time Budget (typical per-node) ==== |
| - | | Ada Lovelace | + | |
| + | ^ | ||
| + | | Pre-upgrade preparation | ||
| + | | BMC update + reset + verify (2.1–2.2) | 10 min | | ||
| + | | Motherboard tray update (2.3) | 15 min | | ||
| + | | GPU tray update (2.4) | 25 min | | ||
| + | | BackgroundCopyStatus poll (2.5) | 10–20 min | | ||
| + | | Chassis power cycle + recheck (2.6–2.7) | 10 min | | ||
| + | | CommitImage (2.8) | 2 min | | ||
| + | | AC power cycle (2.9, includes 6m drain) | 20 min | | ||
| + | | NIC / TPM / NVMe + chassis cycle (2.10) | 15 min | | ||
| + | | Validation (Phase 3) | 15 min | | ||
| + | | **Total — Hopper DGX/ | ||
| + | | GB200 NVL72 rack (add HMC, NVLink Switch tray, rack PDU cycle 10m+15m boot) | **add 1h 15m** | | ||
| + | |||
| + | Schedule a **3-hour window for Hopper**, **4-hour window for Blackwell B200**, **5-hour window for GB200 NVL72**. | ||
| ---- | ---- | ||
| Line 57: | Line 108: | ||
| ==== Tool Overview ==== | ==== Tool Overview ==== | ||
| - | ^ Tool ^ Purpose | + | ^ Tool ^ Purpose |
| - | | **nvfwupd** | + | | **nvfwupd** |
| - | | **NVFlash** | + | | **NVFlash** |
| - | | **nvidia-smi** | Query GPU health, firmware version, running processes | + | | **nvidia-smi** | Query GPU health, firmware version, running processes |
| - | | **nvsm** | + | | **nvsm** |
| - | | **ipmitool** | + | | **ipmitool** |
| + | | **curl** | ||
| - | ==== Tool Selection by Platform | + | ==== nvfwupd Version Requirement |
| - | ^ Platform | + | * **Hopper (H100/H200) minimum: |
| - | | DGX H100 / H200 | + | * **Blackwell (B200 / GB200 / GB300) minimum: |
| - | | HGX H100 / H200 | nvfwupd | + | |
| - | | DGX A100 | Firmware Update Container | | + | |
| - | | Standalone PCIe GPU (A100, RTX etc) | NVFlash | + | |
| ==== Key Commands Reference ==== | ==== Key Commands Reference ==== | ||
| Line 76: | Line 127: | ||
| <code bash> | <code bash> | ||
| # --- nvfwupd --- | # --- nvfwupd --- | ||
| - | # Show installed vs package versions | + | nvfwupd --version |
| + | |||
| + | nvfwupd -t ip=< | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | # Update all firmware components | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | |||
| - | # Update specific components only (with selector JSON) | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | # Force downgrade controls | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | # --- NVFlash | + | # Blackwell / Mission Control activation (replaces ipmitool mc reset cold) |
| - | sudo nvflash --list | + | nvfwupd |
| - | sudo nvflash --version | + | activate_fw |
| - | sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom | + | |
| - | sudo nvflash --index=0 new_firmware.rom | + | # --- mlxfwmanager |
| + | sudo mlxfwmanager --query | ||
| + | sudo mlxfwmanager -u -i fw-ConnectX7-rel-< | ||
| + | |||
| + | # --- NVFlash | ||
| + | sudo nvflash --list | ||
| + | sudo nvflash --version | ||
| + | sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom | ||
| + | sudo nvflash --index=0 new_firmware.rom | ||
| # --- nvidia-smi --- | # --- nvidia-smi --- | ||
| - | nvidia-smi -L # List all GPUs | + | nvidia-smi -L |
| - | nvidia-smi -q # Full GPU info | + | nvidia-smi -q |
| - | nvidia-smi -q | grep "VBIOS Version" | + | nvidia-smi -q | grep "VBIOS Version" |
| - | nvidia-smi --query-compute-apps=pid, | + | nvidia-smi --query-compute-apps=pid, |
| - | nvidia-smi nvlink -s -i 0 # NVLink status for GPU 0 | + | nvidia-smi nvlink -s -i 0 |
| # --- ipmitool --- | # --- ipmitool --- | ||
| - | ipmitool -I lanplus -H < | + | ipmitool -I lanplus -H < |
| - | ipmitool -I lanplus -H < | + | ipmitool -I lanplus -H < |
| - | ipmitool -I lanplus -H < | + | ipmitool -I lanplus -H < |
| - | ipmitool -I lanplus -H < | + | ipmitool -I lanplus -H < |
| # --- nvsm --- | # --- nvsm --- | ||
| - | nvsm show health | + | nvsm show health |
| </ | </ | ||
| Line 117: | Line 174: | ||
| ===== 4. Prerequisites ===== | ===== 4. Prerequisites ===== | ||
| - | __Complete | + | <note important> |
| + | Complete | ||
| + | </ | ||
| ==== Access & Credentials ==== | ==== Access & Credentials ==== | ||
| - | |||
| * [ ] BMC IP address, username, and password (administrator privileges) | * [ ] BMC IP address, username, and password (administrator privileges) | ||
| * [ ] SSH access to the target server | * [ ] SSH access to the target server | ||
| - | * [ ] Access to NVIDIA Enterprise Support Portal | + | |
| + | | ||
| + | |||
| + | ==== Tooling ==== | ||
| + | * [ ] **nvfwupd ≥ 2.0.4** installed on management host | ||
| + | * [ ] ipmitool, curl, jq available on management host | ||
| ==== Firmware Package ==== | ==== Firmware Package ==== | ||
| + | * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages) | ||
| + | * [ ] SHA256 checksum verified against release notes | ||
| + | * [ ] Per-package release notes reviewed for driver minimums and known issues | ||
| + | * [ ] **Previous .fwpkg saved for rollback** | ||
| - | * [ ] Correct | + | ==== BMC Version Hop (pre-1.1.3 systems) ==== |
| - | * [ ] SHA256 checksum verified against NVIDIA release notes | + | * [ ] If current BMC firmware is **v1.0.0 or v1.1.1**, you **must** stage to **v1.1.3** first, reset BMC, and verify, before attempting **v24.09.1** (or any later bundle). The full hop is **v1.0.0 / v1.1.1 → v1.1.3 → v24.09.1**. Skipping the intermediate hop requires a **factory reset** of the BMC to recover. |
| - | * [ ] Package stored on the machine that will run nvfwupd | + | * [ ] Confirm current BMC version (pin to the BMC Redfish path, not a substring match): |
| + | <code bash> | ||
| + | | ||
| + | | grep -E ' | ||
| + | </ | ||
| ==== Environment ==== | ==== Environment ==== | ||
| - | + | | |
| - | | + | * [ ] '' |
| - | * [ ] '' | + | * [ ] Maintenance window: **3h Hopper / 4h B200 / 5h GB200 NVL72** |
| - | * [ ] Maintenance window | + | * [ ] VBIOS backed up via NVFlash (standalone PCIe only) |
| - | * [ ] VBIOS backed up using NVFlash (standalone PCIe GPUs only) | + | * [ ] **MIG disabled** on all GPUs: '' |
| + | * [ ] **nvidia-fabricmanager stopped**: '' | ||
| + | * [ ] **Confidential Computing (CC) mode disabled** if previously enabled — some .fwpkg versions refuse VBIOS flash while CC is on. Check the package release notes. | ||
| + | * [ ] **PSU redundancy healthy**: '' | ||
| + | * [ ] **Driver minimum verified against this package' | ||
| + | * [ ] **Pre-upgrade SEL captured**: '' | ||
| ==== System State ==== | ==== System State ==== | ||
| - | |||
| * [ ] Management host can ping the BMC IP | * [ ] Management host can ping the BMC IP | ||
| - | * [ ] Server | + | * [ ] Server on stable power |
| - | * [ ] No other firmware operations | + | * [ ] No other firmware operations running on the same server |
| - | * [ ] Server reports healthy — confirmed via '' | + | * [ ] '' |
| ==== Rollback Readiness ==== | ==== Rollback Readiness ==== | ||
| - | + | | |
| - | | + | * [ ] Previous firmware .fwpkg saved |
| - | * [ ] Previous firmware .fwpkg | + | |
| * [ ] Rollback procedure reviewed (Section 9) | * [ ] Rollback procedure reviewed (Section 9) | ||
| Line 155: | Line 229: | ||
| ===== 5. Best Practices ===== | ===== 5. Best Practices ===== | ||
| - | ^ # ^ Practice | + | ^ # ^ Practice |
| - | | 1 | Always run '' | + | | 1 |
| - | | 2 | Back up VBIOS before flashing | + | | 2 |
| - | | 3 | Stop all GPU processes including '' | + | | 3 |
| - | | 4 | Never interrupt mid-flash | + | | 4 |
| - | | 5 | Verify the package checksum | + | | 5 |
| - | | 6 | Wait for BackgroundCopyStatus: | + | | 6 |
| - | | 7 | Use a wired management network | + | | 7 | **Always AC-cycle after CommitImage** |
| - | | 8 | + | | 8 | Use a wired management network |
| - | | 9 | Validate before restoring workloads | + | | 9 | Disable ForceUpdate immediately after downgrade |
| - | | 10 | Document pre and post versions | + | | 10 | Update BMC first, everything else after | Subsequent components depend on the new BMC Redfish schema |
| + | | 11 | Validate before restoring workloads | ||
| + | | 12 | Document pre and post versions | ||
| ---- | ---- | ||
| Line 174: | Line 250: | ||
| <code bash> | <code bash> | ||
| - | # DGX / HGX systems | + | # Pin the full pre-upgrade output to a file for the change ticket |
| - | nvfwupd -t ip=< | + | nvfwupd -t ip=< |
| + | | tee preupgrade_versions_$(date +%Y%m%d_%H%M).txt | ||
| # Standalone PCIe GPU | # Standalone PCIe GPU | ||
| - | nvidia-smi -q | grep "VBIOS Version" | + | nvidia-smi -q | grep "VBIOS Version" |
| </ | </ | ||
| - | Copy the full output to your change ticket. Fill in the table: | + | Copy full output to your change ticket |
| - | ^ Component | + | ^ Component |
| - | | VBIOS | | + | | BMC | |
| - | | NVSwitch FW | | + | | SBIOS | |
| - | | EROT FW | | + | | CPLD |
| - | | FPGA FW | | + | | VBIOS | |
| - | | BMC | + | | NVSwitch FW | |
| - | | System BIOS | + | | EROT FW | |
| + | | FPGA FW | | ||
| + | | PSU | ||
| + | | NIC | ||
| + | | TPM | | ||
| + | | NVMe | ||
| ==== Step 1.2 — Verify GPU Health ==== | ==== Step 1.2 — Verify GPU Health ==== | ||
| <code bash> | <code bash> | ||
| - | nvidia-smi -L # All expected GPUs must appear | + | nvidia-smi -L |
| - | nvsm show health | + | nvsm show health |
| </ | </ | ||
| - | |||
| - | __If any GPU is missing or any component is degraded — **stop**. Investigate before proceeding. __ | ||
| ==== Step 1.3 — Stop All GPU Workloads ==== | ==== Step 1.3 — Stop All GPU Workloads ==== | ||
| - | |||
| - | Stop all of the following before continuing: | ||
| - | |||
| - | * AI/ML training jobs, inference servers, GPU-accelerated applications | ||
| - | * Any monitoring script or agent that runs '' | ||
| - | * Kubernetes GPU pods — drain the node first: | ||
| <code bash> | <code bash> | ||
| kubectl drain < | kubectl drain < | ||
| + | nvidia-smi --query-compute-apps=pid, | ||
| + | # Output must be empty | ||
| </ | </ | ||
| - | Confirm no processes remain: | + | ==== Step 1.3b — Disable MIG and Stop Fabric Manager ==== |
| <code bash> | <code bash> | ||
| - | nvidia-smi --query-compute-apps=pid, | + | # Disable MIG on all GPUs (no-op if already disabled) |
| - | # Output must be completely empty before continuing | + | sudo nvidia-smi -mig 0 |
| - | </ | + | |
| - | ==== Step 1.4 — Backup VBIOS (Standalone PCIe GPUs Only) ==== | + | # Stop fabric manager |
| + | # when the chassis is power-cycled in Phase 2. | ||
| + | sudo systemctl stop nvidia-fabricmanager | ||
| + | systemctl is-active nvidia-fabricmanager | ||
| - | __Skip this step for DGX/HGX systems managed by nvfwupd.__ | + | # Confidential Computing — disable if currently enabled and the package |
| + | # release notes require it. (Skip if your environment never uses CC mode.) | ||
| + | # nvidia-smi conf-compute -srs 0 | ||
| + | </code> | ||
| + | |||
| + | ==== Step 1.4 — Backup VBIOS (Standalone PCIe Only) ==== | ||
| <code bash> | <code bash> | ||
| sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom | sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom | ||
| </ | </ | ||
| - | |||
| - | Store the .rom file on a separate host. This is your only rollback option if the flash fails. | ||
| ==== Step 1.5 — Verify Package Integrity ==== | ==== Step 1.5 — Verify Package Integrity ==== | ||
| Line 234: | Line 315: | ||
| sha256sum nvfw_DGX-HGX-H100-H200x8_< | sha256sum nvfw_DGX-HGX-H100-H200x8_< | ||
| </ | </ | ||
| - | |||
| - | Compare the output against the checksum in the NVIDIA release notes. If they do not match — re-download the package before proceeding. | ||
| ==== Step 1.6 — Preview What Will Change ==== | ==== Step 1.6 — Preview What Will Change ==== | ||
| Line 242: | Line 321: | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| </ | </ | ||
| - | |||
| - | Every component with a version mismatch will be updated. Confirm this matches your intent before proceeding. | ||
| ---- | ---- | ||
| Line 249: | Line 326: | ||
| ===== 7. Phase 2 — Firmware Upgrade Execution ===== | ===== 7. Phase 2 — Firmware Upgrade Execution ===== | ||
| - | __ **Do not** close your terminal, disconnect from the BMC, or allow the server to lose power once flashing begins. Any interruption mid-flash can permanently damage hardware. __ | + | <note warning> |
| + | **Strict ordering — do not deviate.** BMC first, then motherboard tray, then GPU tray, then poll, then chassis | ||
| + | </ | ||
| ==== 7.1 DGX / HGX — nvfwupd ==== | ==== 7.1 DGX / HGX — nvfwupd ==== | ||
| - | === Step 2.1 — Flash the GPU Tray === | + | === Step 2.1 — Update BMC === |
| - | Updates VBIOS, NVSwitch, EROT, and FPGA firmware for all GPUs in the system. | + | Use the documented one-shot BMC selector. Do not hand-enumerate child paths — let nvfwupd resolve them from the package. |
| <code bash> | <code bash> | ||
| + | cat > selector_bmc.json <<' | ||
| + | { | ||
| + | " | ||
| + | "/ | ||
| + | ] | ||
| + | } | ||
| + | EOF | ||
| + | |||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | update_fw -p nvfw_DGX-HGX-H100-H200x8_< | + | update_fw -p nvfw_DGX_< |
| </ | </ | ||
| - | To update specific components only, create a selector JSON and pass it with '' | + | === Step 2.2 — Reset the BMC === |
| - | < | + | < |
| - | { "Components": [" | + | ipmitool -I lanplus -H < |
| + | sleep 180 | ||
| + | ipmitool -I lanplus -H < | ||
| + | |||
| + | # Pin the recheck to the BMC Redfish path (not a substring match on "bmc", | ||
| + | # which also matches HGX_FW_BMC_* aggregate entries on some packages). | ||
| + | nvfwupd -t ip=< | ||
| + | | grep -E ' | ||
| </ | </ | ||
| + | |||
| + | Confirm the BMC reports the new version before continuing. | ||
| + | |||
| + | === Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) === | ||
| + | |||
| + | The motherboard tray uses the **empty selector** ('' | ||
| <code bash> | <code bash> | ||
| + | cat > selector_mb.json <<' | ||
| + | {} | ||
| + | EOF | ||
| + | |||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | update_fw -p <pkg.fwpkg> -y -s selector.json | + | update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_mb.json |
| </ | </ | ||
| - | === Step 2.2 — Wait for Background Copy to Complete === | + | Cross-check beforehand with '' |
| - | The firmware is written to the GPU's flash storage in the background. | + | === Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) === |
| + | |||
| + | The GPU tray uses the **HGX_0 one-shot selector**, which targets the entire GPU baseboard inventory in a single call. The BMC fans out internally to all GPUs, NVSwitch chips, EROT, and FPGA — do not enumerate per-GPU paths. | ||
| <code bash> | <code bash> | ||
| - | curl -k -u <USER>:<PASS> \ | + | cat > selector_gpu.json |
| - | | + | { |
| + | | ||
| + | "/ | ||
| + | ] | ||
| + | } | ||
| + | EOF | ||
| + | |||
| + | nvfwupd -t ip=< | ||
| + | update_fw -p nvfw_DGX-HGX-H100-H200x8_< | ||
| </ | </ | ||
| - | Poll every 60 seconds. Only proceed when you see: | + | Verify the exact child paths your platform exposes with '' |
| - | <code -> | + | === Step 2.5 — Wait for Background Copy to Complete === |
| - | " | + | |
| + | < | ||
| + | curl -k -u <USER>:< | ||
| + | https://< | ||
| </ | </ | ||
| - | If you see '' | + | Poll every 60 seconds until '' |
| - | === Step 2.3 — Cold Reboot the Server | + | === Step 2.6 — Chassis Power Cycle === |
| <code bash> | <code bash> | ||
| Line 298: | Line 415: | ||
| Wait 5–10 minutes for the server to fully boot. | Wait 5–10 minutes for the server to fully boot. | ||
| - | === Step 2.4 — Flash the Motherboard Tray === | + | === Step 2.7 — Recheck Versions |
| - | After the server is back online, update | + | <code bash> |
| + | nvfwupd -t ip=< | ||
| + | </ | ||
| + | |||
| + | Confirm staged versions are visible on BMC, SBIOS, GPU tray components. | ||
| + | |||
| + | === Step 2.8 — CommitImage (BMC and SBIOS) === | ||
| + | |||
| + | CommitImage promotes the staged image to the active slot. **Required for BMC and SBIOS.** | ||
| <code bash> | <code bash> | ||
| + | # Commit BMC | ||
| + | curl -k -u < | ||
| + | -H " | ||
| + | -d ' | ||
| + | https://< | ||
| + | |||
| + | # Commit SBIOS | ||
| + | curl -k -u < | ||
| + | -H " | ||
| + | -d ' | ||
| + | https://< | ||
| + | </ | ||
| + | |||
| + | === Step 2.9 — AC Power Cycle (MANDATORY) === | ||
| + | |||
| + | <note warning> | ||
| + | This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient. | ||
| + | </ | ||
| + | |||
| + | - Power off the host via OS shutdown or '' | ||
| + | - **Physically remove all AC power cords** (or open the PDU breakers feeding the chassis). | ||
| + | - **Wait 5–6 minutes** to allow standby rails to fully discharge. | ||
| + | - Reconnect AC power and power the server back on. | ||
| + | - Wait 5–10 minutes for full boot. | ||
| + | |||
| + | === Step 2.10 — Update NIC (mlxfwmanager), | ||
| + | |||
| + | These components are independent of the GPU/EROT activation chain and are updated last. | ||
| + | |||
| + | **Hopper (DGX/HGX H100/H200) — NICs via host-OS mlxfwmanager, | ||
| + | |||
| + | <code bash> | ||
| + | # Inventory and current versions on the host OS | ||
| + | sudo mlxfwmanager --query | ||
| + | |||
| + | # Apply the bundled MFT firmware image (matches your ConnectX SKU) | ||
| + | sudo mlxfwmanager -u -i fw-ConnectX7-rel-< | ||
| + | |||
| + | # Re-query to confirm the new FW Version is present and " | ||
| + | sudo mlxfwmanager --query | ||
| + | </ | ||
| + | |||
| + | A subsequent chassis power cycle (or PCIe FLR via '' | ||
| + | |||
| + | TPM and NVMe remain BMC-driven on Hopper. Confirm the exact inventory paths on your chassis with '' | ||
| + | |||
| + | <code bash> | ||
| + | # After confirming the actual inventory names on your chassis: | ||
| + | nvfwupd -t ip=< | ||
| + | show_version | grep -Ei ' | ||
| + | |||
| + | # Then build a selector_tpm_nvme.json with the paths returned above and: | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | update_fw -p nvfw_DGX_< | + | update_fw -p nvfw_DGX_< |
| </ | </ | ||
| - | __The motherboard tray uses a different .fwpkg file than the GPU tray. Check the NVIDIA release notes for the correct package name for your system. __ | + | **Blackwell (B200 / GB200) — also update onboard ConnectX-7/ |
| - | === Step 2.5 — Reset the BMC === | + | <code bash> |
| + | cat > selector_periph_bw.json <<' | ||
| + | { | ||
| + | " | ||
| + | "/ | ||
| + | "/ | ||
| + | "/ | ||
| + | "/ | ||
| + | "/ | ||
| + | "/ | ||
| + | ] | ||
| + | } | ||
| + | EOF | ||
| + | nvfwupd -t ip=< | ||
| + | update_fw -p nvfw_GB200_< | ||
| + | </ | ||
| + | |||
| + | Then host reboot: | ||
| <code bash> | <code bash> | ||
| - | ipmitool -I lanplus -H < | + | ipmitool -I lanplus -H < |
| </ | </ | ||
| - | Wait 2–3 minutes, then confirm | + | ==== 7.1B Blackwell — GB200 / B200 (Mission Control flow) ==== |
| + | |||
| + | <note important> | ||
| + | This subsection summarises the **NVIDIA Mission Control rack-firmware document** that ships with the Blackwell bundle. Cross-check every selector path and package P-number against the Mission Control doc for your bundle version before executing. The component path names below are the documented Mission Control names — they are not the same as the Hopper '' | ||
| + | </ | ||
| + | |||
| + | Blackwell delivers two top-level packages plus the NVLink Switch tray package: | ||
| + | |||
| + | ^ Package name ^ Selector JSON | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | |||
| + | **Ordering (per Mission Control): | ||
| + | |||
| + | - **B1.** Push '' | ||
| + | - **B2.** '' | ||
| + | - **B3.** **AC power cycle between BMC and HMC updates** (rack PDU cycle for NVL72; chassis cord pull for standalone B200). | ||
| + | - **B4.** Push '' | ||
| + | - **B5.** '' | ||
| + | - **B6.** Push the NVLink Switch tray package from the rack-level management host against the NVLink Switch tray BMC (not the compute tray BMC). | ||
| + | - **B7.** Final **rack-level AC cycle** (Step 2.9 / 2.9b) to activate EROT and CPLD on every tray. | ||
| + | |||
| + | === Step B1 — Push BMC_Full to the chassis | ||
| + | |||
| + | The Mission Control selector JSON is a one-shot; do not enumerate child components. | ||
| <code bash> | <code bash> | ||
| - | ipmitool -I lanplus | + | cat > BMC_Full.json <<' |
| + | { | ||
| + | " | ||
| + | "/ | ||
| + | ] | ||
| + | } | ||
| + | EOF | ||
| + | |||
| + | nvfwupd | ||
| + | update_fw -p nvfw_BMC_P4978.fwpkg -y -s BMC_Full.json | ||
| </ | </ | ||
| - | ==== 7.2 Standalone PCIe GPU — NVFlash ==== | + | === Step B2 — Activate the chassis BMC payload |
| - | === Step 2.6 — Unload | + | Blackwell uses the documented Mission Control activation verb, not '' |
| <code bash> | <code bash> | ||
| - | sudo rmmod nvidia_uvm | + | nvfwupd -t ip=< |
| - | sudo rmmod nvidia_drm | + | |
| - | sudo rmmod nvidia_modeset | + | </ |
| - | sudo rmmod nvidia | + | |
| - | # Confirm | + | === Step B3 — AC cycle between BMC and HMC updates === |
| + | |||
| + | This is **not** the final EROT/CPLD AC cycle — it sits between the two halves of the bundle and is mandatory. | ||
| + | |||
| + | - Standalone B200: chassis cord pull, 5 min drain. | ||
| + | - GB200 NVL72: rack-level PDU cycle (both A-side and B-side feeds), 10 min drain, 15 min boot. | ||
| + | - Confirm | ||
| + | |||
| + | === Step B4 — Push Compute_Full === | ||
| + | |||
| + | '' | ||
| + | |||
| + | <code bash> | ||
| + | cat > Compute_Full.json <<' | ||
| + | { | ||
| + | " | ||
| + | "/ | ||
| + | ] | ||
| + | } | ||
| + | EOF | ||
| + | |||
| + | nvfwupd -t ip=< | ||
| + | update_fw -p nvfw_Compute_P4978.fwpkg -y -s Compute_Full.json | ||
| + | </ | ||
| + | |||
| + | === Step B5 — Activate Compute_Full === | ||
| + | |||
| + | <code bash> | ||
| + | nvfwupd -t ip=< | ||
| + | activate_fw -c RF_AUX_PWR_CYCLE | ||
| + | </ | ||
| + | |||
| + | === Step B6 — NVLink Switch tray (GB200 NVL72) === | ||
| + | |||
| + | Run from the rack-level management host against the NVLink Switch tray BMC. Confirm the exact tray-side selector path names from the Mission Control NVLink-Switch addendum for your bundle — they have changed between P-numbers. | ||
| + | |||
| + | <code bash> | ||
| + | nvfwupd -t ip=< | ||
| + | update_fw -p nvfw_NVLSwitch_P4978.fwpkg -y -s NVLSwitch_Full.json | ||
| + | |||
| + | nvfwupd -t ip=< | ||
| + | activate_fw -c RF_AUX_PWR_CYCLE | ||
| + | </ | ||
| + | |||
| + | === Step B7 — Final rack-level AC cycle (GB200 NVL72) === | ||
| + | |||
| + | For GB200 NVL72 the final AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations: | ||
| + | - Schedule a full rack power-down window. | ||
| + | - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised). | ||
| + | - Wait **10 minutes** | ||
| + | - Restore power; allow 15 minutes for the full rack to come up before validation. | ||
| + | |||
| + | ==== 7.2 Standalone PCIe GPU — NVFlash ==== | ||
| + | |||
| + | === Step 2.11 — Unload the NVIDIA Driver === | ||
| + | |||
| + | <code bash> | ||
| + | sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia | ||
| lsmod | grep nvidia | lsmod | grep nvidia | ||
| </ | </ | ||
| - | === Step 2.7 — Flash the VBIOS === | + | === Step 2.12 — Flash the VBIOS === |
| <code bash> | <code bash> | ||
| - | # Flash GPU at index 0 — increment --index for additional GPUs (1, 2, ...) | ||
| sudo nvflash --index=0 new_vbios_firmware.rom | sudo nvflash --index=0 new_vbios_firmware.rom | ||
| - | # NVFlash will prompt — review current vs new version shown, then type " | ||
| </ | </ | ||
| - | === Step 2.8 — Reboot === | + | === Step 2.13 — Reboot === |
| <code bash> | <code bash> | ||
| Line 356: | Line 638: | ||
| <code bash> | <code bash> | ||
| - | # DGX / HGX | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | |||
| - | # Standalone PCIe | ||
| nvidia-smi -q | grep "VBIOS Version" | nvidia-smi -q | grep "VBIOS Version" | ||
| </ | </ | ||
| - | Every updated | + | Every component must report |
| ==== Step 3.2 — Verify GPU Detection ==== | ==== Step 3.2 — Verify GPU Detection ==== | ||
| Line 370: | Line 649: | ||
| nvidia-smi -L | nvidia-smi -L | ||
| </ | </ | ||
| - | |||
| - | All expected GPUs must appear with correct model names. No '' | ||
| ==== Step 3.3 — Run Health Check ==== | ==== Step 3.3 — Run Health Check ==== | ||
| <code bash> | <code bash> | ||
| - | nvidia-smi -q # Check per-GPU: VBIOS version, ECC mode, temperature, | + | nvidia-smi -q |
| - | nvsm show health | + | nvsm show health |
| </ | </ | ||
| - | ^ Field ^ Expected | + | ==== Step 3.4 — Check NVLink ==== |
| - | | VBIOS Version | + | |
| - | | ECC Mode | Enabled | + | |
| - | | Temperature | + | |
| - | | Power Draw | Within TDP spec | | + | |
| - | + | ||
| - | ==== Step 3.4 — Check NVLink | + | |
| <code bash> | <code bash> | ||
| nvidia-smi nvlink -s -i 0 | nvidia-smi nvlink -s -i 0 | ||
| </ | </ | ||
| - | |||
| - | All NVLink ports must show **Active**. Any inactive port must be investigated before restoring workloads. | ||
| ==== Step 3.5 — Smoke Test ==== | ==== Step 3.5 — Smoke Test ==== | ||
| Line 399: | Line 668: | ||
| python3 -c " | python3 -c " | ||
| </ | </ | ||
| - | |||
| - | Expected: '' | ||
| ==== Step 3.6 — Record Final State ==== | ==== Step 3.6 — Record Final State ==== | ||
| - | ^ Component | + | ^ Component |
| - | | VBIOS | + | | BMC | |
| - | | NVSwitch FW | | + | | SBIOS | |
| - | | EROT FW | | + | | CPLD |
| - | | FPGA FW | | + | | VBIOS | |
| - | | BMC | + | | NVSwitch FW | |
| - | | System BIOS | + | | EROT FW | |
| + | | FPGA FW | | ||
| + | | PSU | | ||
| + | | NIC | | ||
| + | | TPM | | ||
| + | | NVMe | ||
| - | ==== Step 3.7 — Restore Workloads | + | ==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services |
| - | + | ||
| - | Only proceed after all checks above pass: | + | |
| <code bash> | <code bash> | ||
| - | kubectl uncordon | + | # Diff against the pre-upgrade SEL captured in Section 4 |
| - | </code> | + | ipmitool -I lanplus -H <BMC_IP> |
| + | > postupgrade_sel_$(date +%Y%m%d).log | ||
| + | diff preupgrade_sel_*.log postupgrade_sel_*.log | ||
| - | * Restart | + | # Restart |
| - | * Re-enable GPU monitoring scripts | + | sudo systemctl start nvidia-fabricmanager |
| + | sudo systemctl status nvidia-fabricmanager | ||
| - | Monitor the system for 30–60 minutes after restoring workloads. | + | # Re-enable MIG only if your workload requires it |
| + | # sudo nvidia-smi -mig 1 | ||
| + | </ | ||
| - | ---- | + | Investigate any new SEL entries (assertion events, thermal trips, fan failures) before restoring workloads. |
| - | ===== 9. Rollback / Downgrade ===== | + | ==== Step 3.8 — Restore Workloads |
| - | + | ||
| - | __Only downgrade if the new firmware causes a confirmed regression. Raise an NVIDIA Support case before attempting. __ | + | |
| - | + | ||
| - | ==== 9.1 Standalone PCIe GPUs (NVFlash) | + | |
| <code bash> | <code bash> | ||
| - | # Unload driver | + | kubectl uncordon <node-name> |
| - | sudo rmmod nvidia_uvm | + | |
| - | sudo rmmod nvidia_drm | + | |
| - | sudo rmmod nvidia_modeset | + | |
| - | sudo rmmod nvidia | + | |
| - | + | ||
| - | # Flash the saved backup | + | |
| - | sudo nvflash | + | |
| - | + | ||
| - | # Reboot | + | |
| - | sudo reboot | + | |
| </ | </ | ||
| - | Then re-run all Phase 3 validation steps. | + | Monitor for 30–60 minutes. |
| - | ==== 9.2 DGX / HGX Systems (nvfwupd) ==== | + | ---- |
| - | <code bash> | + | ===== 9. Rollback / Downgrade ===== |
| - | # Step 1 — Enable ForceUpdate flag (required to allow downgrade) | + | |
| - | nvfwupd -t ip=< | + | |
| - | # Step 2 — Confirm flag is active | + | Same flow as the forward upgrade, but with the previous .fwpkg and '' |
| - | nvfwupd -t ip=< | + | |
| - | # Step 3 — Flash the previous firmware version | + | ---- |
| - | nvfwupd | + | |
| - | update_fw | + | |
| - | # Step 4 — Wait for BackgroundCopyStatus: | + | ===== 10. Troubleshooting ===== |
| - | # Step 5 — MANDATORY: Disable | + | ^ Symptom |
| - | nvfwupd | + | | '' |
| + | | '' | ||
| + | | VBIOS update blocked | ||
| + | | BackgroundCopyStatus: Failed | ||
| + | | EROT/CPLD version unchanged after reboot | ||
| + | | Server does not boot after AC cycle | Failed flash | BMC serial console + '' | ||
| + | | GPU missing after upgrade | ||
| + | | NVLink ports Inactive | ||
| + | | Firmware unchanged after reboot | ||
| + | | '' | ||
| + | | BMC unreachable | ||
| + | | PSU update refuses to start | PSU redundancy degraded | ||
| + | | Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch | ||
| + | | New SEL entries after upgrade | ||
| - | # Step 6 — Confirm it is disabled | + | ==== BMC Recovery (Unreachable BMC) ==== |
| - | nvfwupd -t ip=< | + | |
| - | </ | + | |
| - | Then re-run all Phase 3 validation steps. | + | If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH), **do not attempt field recovery from a USB image**. The BMC recovery image, layout, and procedure are NVIDIA-internal and SKU-specific; |
| - | ---- | + | Instead: |
| - | + | ||
| - | ===== 10. Troubleshooting ===== | + | |
| - | ^ Symptom | + | - **Open an NVIDIA Enterprise Support case** with severity matching your maintenance window. |
| - | | " | + | - **Capture |
| - | | VBIOS update blocked — GPU activity | + | |
| - | | BackgroundCopyStatus: | + | |
| - | | Server does not boot after reboot | + | |
| - | | GPU missing after upgrade | + | |
| - | | NVLink ports show Inactive | + | |
| - | | Firmware version unchanged after reboot | Rebooted before background copy finished | Re-run the update. Wait for BackgroundCopyStatus: | + | |
| - | | " | + | |
| ---- | ---- | ||
| Line 491: | Line 751: | ||
| ===== 11. Risk Register ===== | ===== 11. Risk Register ===== | ||
| - | ^ Risk ^ Likelihood | + | ^ Risk |
| - | | Wrong firmware file flashed | + | | Wrong firmware file flashed |
| - | | Power loss during flash | Low | Critical | + | | Power loss during flash |
| - | | Workload active during upgrade | + | | Workload active during upgrade |
| - | | Driver / firmware | + | | Driver / firmware mismatch |
| - | | Network loss to BMC mid-update | + | | Network loss to BMC mid-update |
| - | | Previous | + | | Previous |
| - | | ForceUpdate | + | | ForceUpdate left enabled |
| - | | Reboot before background copy completes | Medium | + | | AC cycle skipped — EROT/CPLD not active |
| + | | BMC version-hop skipped (v1.0.0/ | ||
| + | | Reboot before background copy completes | ||
| ---- | ---- | ||
| Line 505: | Line 767: | ||
| ===== 12. References ===== | ===== 12. References ===== | ||
| - | * [[https:// | + | * [[https:// |
| - | * [[https:// | + | * Per-package |
| - | * [[https:// | + | * [[https:// |
| ---- | ---- | ||
| - | |||