User Tools

Site Tools


wiki:ai:gpu_firmware_upgrade

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
wiki:ai:gpu_firmware_upgrade [2026/05/08 17:25] smadanagopalwiki:ai:gpu_firmware_upgrade [2026/05/12 10:00] (current) smadanagopal
Line 1: Line 1:
-====== NVIDIA GPU Firmware Upgrade Runbook ======+====== NVIDIA GPU Firmware Upgrade Runbook (Revised) ======
  
 ===== Process at a Glance ===== ===== Process at a Glance =====
Line 5: Line 5:
   - **Prepare** — record current versions, verify GPU health, stop all workloads   - **Prepare** — record current versions, verify GPU health, stop all workloads
   - **Validate Package** — checksum the firmware file, preview what will change   - **Validate Package** — checksum the firmware file, preview what will change
-  - **Flash GPU Tray** — update VBIOS, NVSwitch, EROT, FPGA via nvfwupd +  - **Phase 2 (in this exact order):** 
-  - **Reboot** — wait for background copy to completethen cold reboot +      - BMC firmware update 
-  - **Flash Motherboard Tray** — update BMCBIOS, PSUsreset BMC+      - BMC reset (mc reset cold) 
 +      - Motherboard tray update (SBIOS, CPLD, PSU, etc.) 
 +      - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) 
 +      Poll BackgroundCopyStatus until **Completed** 
 +      - Chassis power cycle (gracefulvia Redfish/ipmitool) 
 +      Re-check versions on staged components 
 +      - CommitImage on **HostBMC_0** and **HostBIOS_0** 
 +      - **AC power cycle** (full PDU/cord removal5–6 min) 
 +      - NIC firmware via ''mlxfwmanager'' on the host OSTPM, NVMe via BMC (last — independent of the EROT/CPLD chain)
   - **Validate** — confirm new versions, GPU health, NVLink, smoke test   - **Validate** — confirm new versions, GPU health, NVLink, smoke test
  
-**NOTE:** __Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing.__+<note warning> 
 +Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing. 
 + 
 +**Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. 
 +</note>
  
 ---- ----
Line 16: Line 28:
 ===== 1. Overview ===== ===== 1. Overview =====
  
-This runbook covers the end-to-end firmware upgrade process for NVIDIA data center GPUs. It is self-contained — do not consult external documentation mid-upgrade.+This runbook covers the end-to-end firmware upgrade process for NVIDIA DGX/HGX H100/H200 systems, GB200/B200 (Blackwell) systems, and standalone PCIe GPUs. 
 + 
 +**Scope — what this runbook covers:** 
 +  * **Hopper:** DGX H100/H200, HGX H100/H200 
 +  * **Blackwell:** DGX B200, HGX B200, GB200 NVL72 compute tray (per-tray BMC), GB300 
 +  * **Standalone PCIe GPUs** via NVFlash 
 +  * Components: VBIOS, NVSwitch, EROT, FPGA, BMC, SBIOS, CPLD, PSU, NIC, TPM, NVMe, and (Blackwell only) Grace CPU FW, NVLink Switch tray FW, ConnectX-7/8 onboard NICs 
 + 
 +**Blackwell-specific notes (GB200 / B200):** 
 +  * GB200 NVL72 racks are driven from the **NVIDIA Mission Control** rack-firmware tooling. Refer to the Mission Control rack-firmware document shipped with the bundle — section 7.1B below summarises the flow but does **not** replace it. 
 +  * Bundles arrive as **two NVIDIA-released packages** plus the NVLink Switch tray package: ''nvfw_BMC_<P-number>.fwpkg'' and ''nvfw_Compute_<P-number>.fwpkg'' (Mission Control names them by NVIDIA "P-number" — e.g., **P4972** = early Blackwell baseline, **P4975** = first GA refresh, **P4978** = current — and selector JSONs are named ''BMC_Full.json'' and ''Compute_Full.json''). 
 +  * **Ordering, simplified:** chassis BMC (BMC_Full) → **AC cycle between BMC and HMC updates** → HMC + GPU/Grace complex (Compute_Full) → NVLink Switch tray → final rack-level AC cycle for EROT/CPLD. The AC cycle in the middle is required so the new chassis BMC re-enumerates the HMC under the updated Redfish schema before Compute_Full is pushed. 
 +  * **Activation:** Mission Control uses the documented activation method ''activate_fw -c RF_AUX_PWR_CYCLE'' (auxiliary-rail power cycle via Redfish), which replaces the ad-hoc "''chassis power cycle''" step that Hopper uses. 
 +  * The final mandatory **rack-level AC cycle** for EROT/CPLD still applies — for GB200 NVL72 this means power-cycling the full rack PDU feeds (coordinate with DC ops). 
 +  * **nvfwupd floor:** 2.0.4 for Hopper; **2.0.5 for Blackwell** (HMC + Grace + NVLink Switch tray Redfish paths), or **2.0.9** if you intend to run parallel multi-package updates across the rack. 
 + 
 +**Scope — what this runbook does NOT cover:** 
 +  * DGX A100 (use the Firmware Update Container — separate runbook) 
 +  * Standalone Mellanox/ConnectX adapter firmware in non-DGX hosts (use ''mlxfwmanager'' separately) 
 +  * Quantum/Spectrum InfiniBand or Ethernet switch firmware 
 +  * Storage controller / RAID HBA firmware
  
 ---- ----
Line 22: Line 54:
 ===== 2. Background ===== ===== 2. Background =====
  
-==== What Is GPU Firmware? ====+==== Components Updated ====
  
-Firmware is low-level software embedded in the GPU hardware itself. It controls boot behaviour, power management, and hardware communication — below the OS driver layer.+^  Component    What It Controls                            ^  Activation Trigger             ^ 
 +| BMC          | Out-of-band server management                | BMC reset (mc reset cold)       | 
 +| SBIOS / UEFI | Motherboard initialisation                   | Chassis power cycle + CommitImage | 
 +| CPLD         | Board-level glue logic, power sequencing     | **AC power cycle (mandatory)** 
 +| VBIOS        | GPU boot, clocks, power limitsPCIe config  | Chassis power cycle             | 
 +| NVSwitch FW  | GPU-to-GPU interconnect (NVLink fabric)      | Chassis power cycle             | 
 +| EROT         | Hardware root of trust / secure boot         | **AC power cycle (mandatory)** 
 +| FPGA         | Power management and control plane           | Chassis power cycle             | 
 +| PSU          | Power supply unit firmware                   | Live, one at a time — redundant PSU carries loadRefuses to flash if redundancy is unhealthy. | 
 +| NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/8 on Blackwell) | Chassis power cycle (run AFTER the AC cycle in this runbook) | 
 +| TPM          | Trusted Platform Module                      | Chassis power cycle (run AFTER the AC cycle in this runbook) | 
 +| NVMe         | Onboard storage firmware                     | Chassis power cycle (run AFTER the AC cycle in this runbook) |
  
-**Components updated in a full DGX/HGX firmware upgrade:**+<note important> 
 +EROT and CPLD images are staged into a secondary flash region by the update, but only activate on a full AC power removal — a warm reboot or chassis power cycle is not sufficient. 
 +</note>
  
-^  Component  ^  What It Controls +==== Driver Compatibility ====
-| VBIOS  | GPU boot, clocks, power limits, PCIe config +
-| NVSwitch  | GPU-to-GPU interconnect (NVLink fabric) +
-| EROT  | Hardware root of trust / secure boot  | +
-| FPGA  | Power management and control plane  | +
-| BMC  | Out-of-band server management +
-| BIOS / UEFI  | Motherboard initialisation +
-| PSU  | Power supply unit firmware  |+
  
-==== Why Upgrade? ====+The minimum driver depends on the **firmware package version**, not the GPU architecture alone. Refer to the **per-package release notes** shipped with the .fwpkg you intend to install. Do not rely on generic architecture → driver tables; they go stale quickly as new packages add features and ABI requirements.
  
-  * Fix GPU stability bugs or uncorrectable memory errors +==== NVSwitch Chip vs NVLink Switch Tray — Do Not Confuse ====
-  * Apply security patches to the hardware trust chain +
-  * Meet minimum firmware requirements for new driver versions +
-  * Enable features such as Confidential Computing or NVLink enhancements+
  
-==== Minimum Driver Requirements ====+  * **NVSwitch chip** — silicon on the GPU baseboard. Inventory path: ''/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVSwitch_<n>''. Updated via the compute tray BMC as part of the GPU tray package. 
 +  * **NVLink Switch tray** — separate rack-level switch chassis in GB200 NVL72 racks. Inventory path: ''/redfish/v1/UpdateService/FirmwareInventory/NVSwitch_Tray_*''. Updated via the **NVLink Switch tray BMC**, not the compute tray BMC, with its own .fwpkg.
  
-^  GPU Architecture  ^  Minimum Driver  ^ +Selecting the wrong target will return a Redfish 404 and silently skip the component. 
-Ampere (A100, A30 R450  + 
-Hopper (H100H200 R525  +==== Time Budget (typical per-node) ==== 
-Ada Lovelace (L40L40S R525  |+ 
 +^  Phase                                ^  Estimated Time  ^ 
 +Pre-upgrade preparation (Phase 1    20 min           
 +BMC update + reset + verify (2.1–2.2) | 10 min           | 
 +| Motherboard tray update (2.3)         | 15 min           | 
 +| GPU tray update (2.4)                 | 25 min           | 
 +| BackgroundCopyStatus poll (2.5)       | 10–20 min        | 
 +| Chassis power cycle + recheck (2.6–2.7) | 10 min         | 
 +| CommitImage (2.8)                     | 2 min            | 
 +| AC power cycle (2.9includes 6m drain) | 20 min         
 +NIC / TPM / NVMe + chassis cycle (2.10) | 15 min         | 
 +| Validation (Phase 3)                  | 15 min           | 
 +| **Total — Hopper DGX/HGX**            | **~2h 30m**      | 
 +| GB200 NVL72 rack (add HMC, NVLink Switch trayrack PDU cycle 10m+15m boot) | **add 1h 15m** | 
 + 
 +Schedule a **3-hour window for Hopper**, **4-hour window for Blackwell B200**, **5-hour window for GB200 NVL72**.
  
 ---- ----
Line 57: Line 108:
 ==== Tool Overview ==== ==== Tool Overview ====
  
-^  Tool  ^  Purpose  ^  Where It Runs  +^  Tool          ^  Purpose                                                        ^  Where It Runs           
-| **nvfwupd**  | Full firmware bundle update for DGX/HGX via BMC Redfish API  | Management host (remote) | +| **nvfwupd**    | Full firmware bundle update for DGX/HGX via BMC Redfish API     | Management host (remote) | 
-| **NVFlash**  | VBIOS-only flash for standalone PCIe GPUs  | GPU host (root required) | +| **NVFlash**    | VBIOS-only flash for standalone PCIe GPUs                       | GPU host (root required) | 
-| **nvidia-smi** | Query GPU health, firmware version, running processes  | GPU host  +| **nvidia-smi** | Query GPU health, firmware version, running processes           | GPU host                 
-| **nvsm**  | DGX system-wide health check  | GPU host  +| **nvsm**       | DGX system-wide health check                                    | GPU host                 
-| **ipmitool**  | Remote power cycle and BMC reset  | Management host  |+| **ipmitool**   | Remote power cycle and BMC reset                                | Management host          | 
 +| **curl**       | Redfish polling (BackgroundCopyStatus, CommitImage)             | Management host          |
  
-==== Tool Selection by Platform ====+==== nvfwupd Version Requirement ====
  
-^  Platform  ^  Tool  ^ +  * **Hopper (H100/H200) minimum: nvfwupd 2.0.4** — earlier versions do not support CommitImage or the inventory paths used below. 
-| DGX H100 / H200  nvfwupd  +  * **Blackwell (B200 / GB200 GB300) minimum: nvfwupd 2.0.5** — required for HMC, Grace CPU, and NVLink Switch tray Redfish paths. Use **2.0.9** if you plan to push multiple packages in parallel from the rack management host (BMC_Full + Compute_Full). 
-| HGX H100 H200  | nvfwupd  | +  * Confirm with: ''nvfwupd --version'' 
-| DGX A100  | Firmware Update Container | +  * Download from the NVIDIA Enterprise Support Portal.
-| Standalone PCIe GPU (A100, RTX etc) | NVFlash  |+
  
 ==== Key Commands Reference ==== ==== Key Commands Reference ====
Line 76: Line 127:
 <code bash> <code bash>
 # --- nvfwupd --- # --- nvfwupd ---
-Show installed vs package versions+nvfwupd --version                                          MUST be >= 2.0.4 
 + 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg> nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg>
  
-# Update all firmware components 
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y
- 
-# Update specific components only (with selector JSON) 
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y -s selector.json nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y -s selector.json
  
-# Force downgrade controls 
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable
  
-# --- NVFlash (standalone PCIe GPUs only) --- +Blackwell / Mission Control activation (replaces ipmitool mc reset cold) 
-sudo nvflash --list                                        # List all GPUs and indices +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
-sudo nvflash --version                                     # Show VBIOS version (GPU 0) +  activate_fw -c RF_AUX_PWR_CYCLE 
-sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom  # Backup current VBIOS + 
-sudo nvflash --index=0 new_firmware.rom                    # Flash new VBIOS to GPU 0+--- mlxfwmanager (host OS, Hopper NIC updates--- 
 +sudo mlxfwmanager --query 
 +sudo mlxfwmanager -u -i fw-ConnectX7-rel-<version>.bin --yes 
 + 
 +# --- NVFlash --- 
 +sudo nvflash --list 
 +sudo nvflash --version 
 +sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom 
 +sudo nvflash --index=0 new_firmware.rom
  
 # --- nvidia-smi --- # --- nvidia-smi ---
-nvidia-smi -L                                              # List all GPUs +nvidia-smi -L 
-nvidia-smi -q                                              # Full GPU info +nvidia-smi -q 
-nvidia-smi -q | grep "VBIOS Version"                      # VBIOS version only +nvidia-smi -q | grep "VBIOS Version" 
-nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv  # Active GPU processes +nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv 
-nvidia-smi nvlink -s -i 0                                  # NVLink status for GPU 0+nvidia-smi nvlink -s -i 0
  
 # --- ipmitool --- # --- ipmitool ---
-ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle  # Cold reboot +ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle 
-ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold        # BMC-only reset +ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold 
-ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info              # Confirm BMC online +ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info 
-ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> sel list             # BMC event log+ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> sel list
  
 # --- nvsm --- # --- nvsm ---
-nvsm show health                                           # System-wide health (DGX only)+nvsm show health
 </code> </code>
  
Line 117: Line 174:
 ===== 4. Prerequisites ===== ===== 4. Prerequisites =====
  
-__Complete every item below before starting. Do not proceed if any item cannot be confirmed.__+<note important> 
 +Complete every item below before starting. Do not proceed if any item cannot be confirmed. 
 +</note>
  
 ==== Access & Credentials ==== ==== Access & Credentials ====
- 
   * [ ] BMC IP address, username, and password (administrator privileges)   * [ ] BMC IP address, username, and password (administrator privileges)
   * [ ] SSH access to the target server   * [ ] SSH access to the target server
-  * [ ] Access to NVIDIA Enterprise Support Portal to download firmware packages+  * [ ] **Physical or remote-hands access for the AC power cycle step** 
 +  * [ ] Access to NVIDIA Enterprise Support Portal 
 + 
 +==== Tooling ==== 
 +  * [ ] **nvfwupd ≥ 2.0.4** installed on management host 
 +  * [ ] ipmitool, curl, jq available on management host
  
 ==== Firmware Package ==== ==== Firmware Package ====
 +  * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages)
 +  * [ ] SHA256 checksum verified against release notes
 +  * [ ] Per-package release notes reviewed for driver minimums and known issues
 +  * [ ] **Previous .fwpkg saved for rollback**
  
-  * [ ] Correct .fwpkg file downloaded for your specific GPU model +==== BMC Version Hop (pre-1.1.3 systems) ==== 
-  * [ ] SHA256 checksum verified against NVIDIA release notes +  * [ ] If current BMC firmware is **v1.0.0 or v1.1.1**, you **must** stage to **v1.1.3** first, reset BMC, and verify, before attempting **v24.09.1** (or any later bundle). The full hop is **v1.0.0 / v1.1.1 → v1.1.3 → v24.09.1**. Skipping the intermediate hop requires a **factory reset** of the BMC to recover. 
-  * [ ] Package stored on the machine that will run nvfwupd+  * [ ] Confirm current BMC version (pin to the BMC Redfish path, not a substring match): 
 +    <code bash> 
 +    nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version \ 
 +      | grep -E '(^|\s)HostBMC_0\b' 
 +    </code>
  
 ==== Environment ==== ==== Environment ====
- +  * [ ] All GPU workloads stopped 
-  * [ ] All GPU workloads stopped — training jobs, inference servers, Kubernetes GPU pods +  * [ ] ''nvidia-smi'' is not running (passive queries block VBIOS updates) 
-  * [ ] ''nvidia-smi'' is not running — even passive queries block VBIOS updates +  * [ ] Maintenance window: **3h Hopper / 4h B200 / 5h GB200 NVL72** (see Time Budget in Section 2) 
-  * [ ] Maintenance window scheduled and all stakeholders notified (minimum hours+  * [ ] VBIOS backed up via NVFlash (standalone PCIe only) 
-  * [ ] VBIOS backed up using NVFlash (standalone PCIe GPUs only)+  * [ ] **MIG disabled** on all GPUs: ''sudo nvidia-smi -mig 0'' (some packages refuse to flash with MIG instances active) 
 +  * [ ] **nvidia-fabricmanager stopped**: ''sudo systemctl stop nvidia-fabricmanager'' (DGX/HGX only; prevents dirty shutdown during chassis cycle) 
 +  * [ ] **Confidential Computing (CC) mode disabled** if previously enabled — some .fwpkg versions refuse VBIOS flash while CC is on. Check the package release notes. 
 +  * [ ] **PSU redundancy healthy**: ''nvsm show health | grep -i psu'' (all PSUs OK; PSU update will refuse if redundancy is degraded) 
 +  * [ ] **Driver minimum verified against this package's release notes**: ''nvidia-smi --query-gpu=driver_version --format=csv,noheader'' ≥ the minimum listed in the .fwpkg release notes 
 +  * [ ] **Pre-upgrade SEL captured**: ''ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> sel list > preupgrade_sel_$(date +%Y%m%d).log''
  
 ==== System State ==== ==== System State ====
- 
   * [ ] Management host can ping the BMC IP   * [ ] Management host can ping the BMC IP
-  * [ ] Server is on stable power — no UPS-only or unreliable supply +  * [ ] Server on stable power 
-  * [ ] No other firmware operations are running on the same server +  * [ ] No other firmware operations running on the same server 
-  * [ ] Server reports healthy — confirmed via ''nvsm show health'' or BMC dashboard+  * [ ] ''nvsm show health'' clean
  
 ==== Rollback Readiness ==== ==== Rollback Readiness ====
- +  * [ ] Current firmware versions recorded in change ticket 
-  * [ ] Current firmware versions recorded in your change ticket +  * [ ] Previous firmware .fwpkg saved
-  * [ ] Previous firmware .fwpkg file saved or accessible+
   * [ ] Rollback procedure reviewed (Section 9)   * [ ] Rollback procedure reviewed (Section 9)
  
Line 155: Line 229:
 ===== 5. Best Practices ===== ===== 5. Best Practices =====
  
-^  #  ^  Practice  ^  Why It Matters  +^  #  ^  Practice                                       ^  Why It Matters                                               
-| 1  | Always run ''show_version'' first  | Confirms exactly what will change before any flashing begins  +| 1   | Always run ''show_version'' first                | Confirms exactly what will change                             
-| 2  | Back up VBIOS before flashing  | No backup = no rollback path for standalone GPU upgrades  +| 2   | Back up VBIOS before flashing                    | No backup = no rollback path                                  
-| 3  | Stop all GPU processes including ''nvidia-smi''  | Active processes block or corrupt the VBIOS update  +| 3   | Stop all GPU processes including ''nvidia-smi''  | Active processes block or corrupt VBIOS updates               
-| 4  | Never interrupt mid-flash  | Closing terminal or losing power mid-flash bricks the GPU  +| 4   | Never interrupt mid-flash                        | Closing terminal or losing power bricks the GPU               
-| 5  | Verify the package checksum  | A corrupted .fwpkg silently fails or causes a partial flash  +| 5   | Verify the package checksum                      | A corrupted .fwpkg silently fails                             
-| 6  | Wait for BackgroundCopyStatus: Completed  | Rebooting early means the firmware never actually activates  +| 6   | Wait for BackgroundCopyStatus: Completed         | Rebooting early means firmware never activates                
-| 7  | Use a wired management network  | VPN or WiFi drops mid-update leave the flash incomplete  +| 7   | **Always AC-cycle after CommitImage**            | EROT and CPLD only activate on full AC removal                | 
-8  | Disable ForceUpdate immediately after downgrade  | Leaving it enabled is a security risk  | +| 8   | Use a wired management network                   | VPN or WiFi drops leave flash incomplete                      
- | Validate before restoring workloads  | Confirms the upgrade succeeded before returning node to use  +9   | Disable ForceUpdate immediately after downgrade  | Leaving it enabled is a security risk                         | 
-10  | Document pre and post versions in change ticket  | Required for change management and future rollback reference  |+| 10  | Update BMC first, everything else after          | Subsequent components depend on the new BMC Redfish schema    
 +11  | Validate before restoring workloads              | Confirms upgrade succeeded                                    
 +12  | Document pre and post versions                   | Required for change management                                |
  
 ---- ----
Line 174: Line 250:
  
 <code bash> <code bash>
-DGX / HGX systems +Pin the full pre-upgrade output to a file for the change ticket 
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version+nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version 
 +  | tee preupgrade_versions_$(date +%Y%m%d_%H%M).txt
  
 # Standalone PCIe GPU # Standalone PCIe GPU
-nvidia-smi -q | grep "VBIOS Version"+nvidia-smi -q | grep "VBIOS Version" | tee preupgrade_vbios_$(date +%Y%m%d).txt
 </code> </code>
  
-Copy the full output to your change ticket. Fill in the table:+Copy full output to your change ticket and fill in:
  
-^  Component  ^  Pre-Upgrade Version  ^  Target Version +^  Component   ^  Pre-Upgrade Version  ^  Target Version  ^ 
-| VBIOS      +| BMC          |                                        | 
-| NVSwitch FW  |     +| SBIOS        |                                        | 
-| EROT FW      +| CPLD                                                | 
-| FPGA FW      +| VBIOS                                               
-BMC      +| NVSwitch FW  |                                        
-System BIOS      |+| EROT FW                                             
 +| FPGA FW                                             
 +PSU                                                 
 +NIC                                                 | 
 +| TPM          |                                        | 
 +| NVMe                                                |
  
 ==== Step 1.2 — Verify GPU Health ==== ==== Step 1.2 — Verify GPU Health ====
  
 <code bash> <code bash>
-nvidia-smi -L       # All expected GPUs must appear +nvidia-smi -L 
-nvsm show health    # All components must show OK  (DGX only)+nvsm show health
 </code> </code>
- 
-__If any GPU is missing or any component is degraded — **stop**. Investigate before proceeding. __ 
  
 ==== Step 1.3 — Stop All GPU Workloads ==== ==== Step 1.3 — Stop All GPU Workloads ====
- 
-Stop all of the following before continuing: 
- 
-  * AI/ML training jobs, inference servers, GPU-accelerated applications 
-  * Any monitoring script or agent that runs ''nvidia-smi'' on a schedule 
-  * Kubernetes GPU pods — drain the node first: 
  
 <code bash> <code bash>
 kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
 +nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv
 +# Output must be empty
 </code> </code>
  
-Confirm no processes remain:+==== Step 1.3b — Disable MIG and Stop Fabric Manager ====
  
 <code bash> <code bash>
-nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv +# Disable MIG on all GPUs (no-op if already disabled) 
-# Output must be completely empty before continuing +sudo nvidia-smi -mig 0
-</code>+
  
-==== Step 1.4 — Backup VBIOS (Standalone PCIe GPUs Only====+# Stop fabric manager (DGX/HGX/Blackwell— prevents a dirty shutdown 
 +# when the chassis is power-cycled in Phase 2. 
 +sudo systemctl stop nvidia-fabricmanager 
 +systemctl is-active nvidia-fabricmanager   # must report "inactive"
  
-__Skip this step for DGX/HGX systems managed by nvfwupd.__+# Confidential Computing — disable if currently enabled and the package 
 +# release notes require it. (Skip if your environment never uses CC mode.) 
 +# nvidia-smi conf-compute -srs 0 
 +</code> 
 + 
 +==== Step 1.4 — Backup VBIOS (Standalone PCIe Only) ====
  
 <code bash> <code bash>
 sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom
 </code> </code>
- 
-Store the .rom file on a separate host. This is your only rollback option if the flash fails. 
  
 ==== Step 1.5 — Verify Package Integrity ==== ==== Step 1.5 — Verify Package Integrity ====
Line 234: Line 315:
 sha256sum nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg sha256sum nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg
 </code> </code>
- 
-Compare the output against the checksum in the NVIDIA release notes. If they do not match — re-download the package before proceeding. 
  
 ==== Step 1.6 — Preview What Will Change ==== ==== Step 1.6 — Preview What Will Change ====
Line 242: Line 321:
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg> nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg>
 </code> </code>
- 
-Every component with a version mismatch will be updated. Confirm this matches your intent before proceeding. 
  
 ---- ----
Line 249: Line 326:
 ===== 7. Phase 2 — Firmware Upgrade Execution ===== ===== 7. Phase 2 — Firmware Upgrade Execution =====
  
-__ **Do not** close your terminal, disconnect from the BMC, or allow the server to lose power once flashing begins. Any interruption mid-flash can permanently damage hardware__+<note warning> 
 +**Strict ordering — do not deviate.** BMC firstthen motherboard tray, then GPU tray, then poll, then chassis power cycle, then CommitImage, then AC cycle, then NIC/TPM/NVMe. 
 +</note>
  
 ==== 7.1 DGX / HGX — nvfwupd ==== ==== 7.1 DGX / HGX — nvfwupd ====
  
-=== Step 2.1 — Flash the GPU Tray ===+=== Step 2.1 — Update BMC ===
  
-Updates VBIOS, NVSwitch, EROT, and FPGA firmware for all GPUs in the system.+Use the documented one-shot BMC selector. Do not hand-enumerate child paths — let nvfwupd resolve them from the package.
  
 <code bash> <code bash>
 +cat > selector_bmc.json <<'EOF'
 +{
 +  "Targets": [
 +    "/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"
 +  ]
 +}
 +EOF
 +
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \
-  update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y+  update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_bmc.json
 </code> </code>
  
-To update specific components only, create a selector JSON and pass it with ''-s'':+=== Step 2.2 — Reset the BMC ===
  
-<code json+<code bash
-"Components": ["VBIOS"] }+ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold 
 +sleep 180 
 +ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info 
 + 
 +# Pin the recheck to the BMC Redfish path (not a substring match on "bmc"
 +# which also matches HGX_FW_BMC_* aggregate entries on some packages). 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version \ 
 +  | grep -E '(^|\s)HostBMC_0\b'
 </code> </code>
 +
 +Confirm the BMC reports the new version before continuing.
 +
 +=== Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) ===
 +
 +The motherboard tray uses the **empty selector** (''{}''), which tells nvfwupd to apply every motherboard-side component the package declares. Do not hand-enumerate ''HostBIOS_0'' / ''CPLDMB_0'' / ''PSU_*'' — the path names vary by SKU and package, and a typo silently skips the component.
  
 <code bash> <code bash>
 +cat > selector_mb.json <<'EOF'
 +{}
 +EOF
 +
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \
-  update_fw -p <pkg.fwpkg-y -s selector.json+  update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_mb.json
 </code> </code>
  
-=== Step 2.2 — Wait for Background Copy to Complete ===+Cross-check beforehand with ''show_version -p <pkg.fwpkg>'' to confirm which components the package will touch.
  
-The firmware is written to the GPU's flash storage in the background. **You must wait for this before rebooting.**+=== Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) === 
 + 
 +The GPU tray uses the **HGX_0 one-shot selector**, which targets the entire GPU baseboard inventory in a single call. The BMC fans out internally to all GPUs, NVSwitch chips, EROT, and FPGA — do not enumerate per-GPU paths.
  
 <code bash> <code bash>
-curl -k -u <USER>:<PASS> \ +cat selector_gpu.json <<'EOF' 
-  https://<BMC_IP>/redfish/v1/Systems/HGX_Baseboard_0/Oem/Nvidia/BackgroundCopyStatus+{ 
 +  "Targets"
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_0" 
 +  ] 
 +
 +EOF 
 + 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS>
 +  update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y -s selector_gpu.json
 </code> </code>
  
-Poll every 60 secondsOnly proceed when you see:+Verify the exact child paths your platform exposes with ''nvfwupd ... show_version'' on a real DGX H100/H200 before relying on inventory names elsewhere in this runbook — they can differ between H100 and H200 chassis SKUs.
  
-<code -> +=== Step 2.5 — Wait for Background Copy to Complete === 
-"BackgroundCopyStatus""Completed"+ 
 +<code bash> 
 +curl -k -u <USER>:<PASS> \ 
 +  https://<BMC_IP>/redfish/v1/Chassis/HGX_ERoT_BMC_0
 </code> </code>
  
-If you see ''Failed'' — do **not** reboot. See the Troubleshooting section.+Poll every 60 seconds until ''BackgroundCopyStatus'' reports ''Completed''If ''Failed'' — do **not** proceed; see Troubleshooting.
  
-=== Step 2.— Cold Reboot the Server ===+=== Step 2.— Chassis Power Cycle ===
  
 <code bash> <code bash>
Line 298: Line 415:
 Wait 5–10 minutes for the server to fully boot. Wait 5–10 minutes for the server to fully boot.
  
-=== Step 2.— Flash the Motherboard Tray ===+=== Step 2.— Recheck Versions ===
  
-After the server is back online, update BMC, BIOS, and PSU firmware:+<code bash> 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version 
 +</code> 
 + 
 +Confirm staged versions are visible on BMC, SBIOSGPU tray components. 
 + 
 +=== Step 2.8 — CommitImage (BMC and SBIOS) === 
 + 
 +CommitImage promotes the staged image to the active slot. **Required for BMC and SBIOS.**
  
 <code bash> <code bash>
 +# Commit BMC
 +curl -k -u <USER>:<PASS> -X POST \
 +  -H "Content-Type: application/json" \
 +  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"]}' \
 +  https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage
 +
 +# Commit SBIOS
 +curl -k -u <USER>:<PASS> -X POST \
 +  -H "Content-Type: application/json" \
 +  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HostBIOS_0"]}' \
 +  https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage
 +</code>
 +
 +=== Step 2.9 — AC Power Cycle (MANDATORY) ===
 +
 +<note warning>
 +This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient.
 +</note>
 +
 +  - Power off the host via OS shutdown or ''ipmitool chassis power off''.
 +  - **Physically remove all AC power cords** (or open the PDU breakers feeding the chassis).
 +  - **Wait 5–6 minutes** to allow standby rails to fully discharge.
 +  - Reconnect AC power and power the server back on.
 +  - Wait 5–10 minutes for full boot.
 +
 +=== Step 2.10 — Update NIC (mlxfwmanager), TPM, NVMe ===
 +
 +These components are independent of the GPU/EROT activation chain and are updated last.
 +
 +**Hopper (DGX/HGX H100/H200) — NICs via host-OS mlxfwmanager, not Redfish.** On Hopper the ConnectX adapters are not exposed under the HGX BMC's firmware inventory; the BMC will return 404 / "no such target" for ''HGX_FW_NIC_*'' paths. Update them from the host OS instead:
 +
 +<code bash>
 +# Inventory and current versions on the host OS
 +sudo mlxfwmanager --query
 +
 +# Apply the bundled MFT firmware image (matches your ConnectX SKU)
 +sudo mlxfwmanager -u -i fw-ConnectX7-rel-<version>.bin --yes
 +
 +# Re-query to confirm the new FW Version is present and "Status: Done"
 +sudo mlxfwmanager --query
 +</code>
 +
 +A subsequent chassis power cycle (or PCIe FLR via ''mlxfwreset'') is required to activate the new NIC firmware — do this **after** the AC cycle in Step 2.9.
 +
 +TPM and NVMe remain BMC-driven on Hopper. Confirm the exact inventory paths on your chassis with ''nvfwupd ... show_version'' before flashing — names like ''TPM_0'' / ''NVMe_0'' (without the ''HGX_FW_'' prefix) appear on some packages.
 +
 +<code bash>
 +# After confirming the actual inventory names on your chassis:
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \
 +  show_version | grep -Ei '(tpm|nvme)'
 +
 +# Then build a selector_tpm_nvme.json with the paths returned above and:
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \
-  update_fw -p nvfw_DGX_<version>.fwpkg -y -s update_bmc.json+  update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_tpm_nvme.json
 </code> </code>
  
-__The motherboard tray uses a different .fwpkg file than the GPU tray. Check the NVIDIA release notes for the correct package name for your system. __+**Blackwell (B200 / GB200) — also update onboard ConnectX-7/8 NICs:**
  
-=== Step 2.5 — Reset the BMC ===+<code bash> 
 +cat > selector_periph_bw.json <<'EOF' 
 +
 +  "Targets":
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_ConnectX_0", 
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_ConnectX_1", 
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_ConnectX_2", 
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_ConnectX_3", 
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_TPM_0", 
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVMe_0" 
 +  ] 
 +
 +EOF
  
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \
 +  update_fw -p nvfw_GB200_<version>.fwpkg -y -s selector_periph_bw.json
 +</code>
 +
 +Then host reboot:
 <code bash> <code bash>
-ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold+ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle
 </code> </code>
  
-Wait 2–3 minutes, then confirm the BMC is back online:+==== 7.1B Blackwell — GB200 / B200 (Mission Control flow) ==== 
 + 
 +<note important> 
 +This subsection summarises the **NVIDIA Mission Control rack-firmware document** that ships with the Blackwell bundle. Cross-check every selector path and package P-number against the Mission Control doc for your bundle version before executing. The component path names below are the documented Mission Control names — they are not the same as the Hopper ''HGX_FW_*'' paths and must not be cross-pollinated. 
 +</note> 
 + 
 +Blackwell delivers two top-level packages plus the NVLink Switch tray package: 
 + 
 +^  Package name                    ^  Selector JSON        P-number examples         ^ 
 +| ''nvfw_BMC_<P-number>.fwpkg''    | ''BMC_Full.json''    | P4972 (baseline), P4975 (GA refresh), P4978 (current) | 
 +| ''nvfw_Compute_<P-number>.fwpkg''| ''Compute_Full.json''| matched P-number to BMC package                       | 
 +| ''nvfw_NVLSwitch_<P-number>.fwpkg'' | ''NVLSwitch_Full.json'' | (rack-level NVLink Switch tray)                | 
 + 
 +**Ordering (per Mission Control):** 
 + 
 +  - **B1.** Push ''BMC_Full'' to the **chassis BMC**. 
 +  - **B2.** ''activate_fw -c RF_AUX_PWR_CYCLE'' on the chassis BMC. 
 +  - **B3.** **AC power cycle between BMC and HMC updates** (rack PDU cycle for NVL72; chassis cord pull for standalone B200). Wait 5 min for standby drain. This is required so the new chassis BMC re-enumerates the HMC under the updated Redfish schema before Compute_Full is pushed. 
 +  - **B4.** Push ''Compute_Full'' (HMC + Grace CPU FW + GPU/NVSwitch/EROT/FPGA on the compute tray). 
 +  - **B5.** ''activate_fw -c RF_AUX_PWR_CYCLE'' for the Compute_Full payload. 
 +  - **B6.** Push the NVLink Switch tray package from the rack-level management host against the NVLink Switch tray BMC (not the compute tray BMC). 
 +  - **B7.** Final **rack-level AC cycle** (Step 2.9 / 2.9b) to activate EROT and CPLD on every tray. 
 + 
 +=== Step B1 — Push BMC_Full to the chassis BMC === 
 + 
 +The Mission Control selector JSON is a one-shot; do not enumerate child components.
  
 <code bash> <code bash>
-ipmitool -I lanplus -<BMC_IP> -U <USER> -P <PASS> mc info+cat > BMC_Full.json <<'EOF' 
 +
 +  "Targets":
 +    "/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0" 
 +  ] 
 +
 +EOF 
 + 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
 +  update_fw -p nvfw_BMC_P4978.fwpkg -y -s BMC_Full.json
 </code> </code>
  
-==== 7.2 Standalone PCIe GPU — NVFlash ====+=== Step B2 — Activate the chassis BMC payload ===
  
-=== Step 2.6 — Unload the NVIDIA Driver ===+Blackwell uses the documented Mission Control activation verb, not ''ipmitool mc reset cold'':
  
 <code bash> <code bash>
-sudo rmmod nvidia_uvm +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
-sudo rmmod nvidia_drm +  activate_fw -c RF_AUX_PWR_CYCLE 
-sudo rmmod nvidia_modeset +</code>
-sudo rmmod nvidia+
  
-Confirm all modules are unloaded (output must be empty)+=== Step B3 — AC cycle between BMC and HMC updates === 
 + 
 +This is **not** the final EROT/CPLD AC cycle — it sits between the two halves of the bundle and is mandatory. 
 + 
 +  - Standalone B200: chassis cord pull, 5 min drain. 
 +  - GB200 NVL72: rack-level PDU cycle (both A-side and B-side feeds), 10 min drain, 15 min boot. 
 +  - Confirm the chassis BMC comes back on the new firmware before proceeding: ''nvfwupd ... show_version | grep -E '(^|\s)HostBMC_0\b'''
 + 
 +=== Step B4 — Push Compute_Full === 
 + 
 +''Compute_Full'' bundles HMC, Grace CPU FW, GPU/NVSwitch/EROT/FPGA for the compute tray. 
 + 
 +<code bash> 
 +cat > Compute_Full.json <<'EOF' 
 +
 +  "Targets":
 +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_0" 
 +  ] 
 +
 +EOF 
 + 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
 +  update_fw -p nvfw_Compute_P4978.fwpkg -y -s Compute_Full.json 
 +</code> 
 + 
 +=== Step B5 — Activate Compute_Full === 
 + 
 +<code bash> 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
 +  activate_fw -c RF_AUX_PWR_CYCLE 
 +</code> 
 + 
 +=== Step B6 — NVLink Switch tray (GB200 NVL72) === 
 + 
 +Run from the rack-level management host against the NVLink Switch tray BMC. Confirm the exact tray-side selector path names from the Mission Control NVLink-Switch addendum for your bundle — they have changed between P-numbers. 
 + 
 +<code bash> 
 +nvfwupd -t ip=<NVL_SW_BMC_IP> user=<USER> password=<PASS> servertype=NVL_SWITCH \ 
 +  update_fw -p nvfw_NVLSwitch_P4978.fwpkg -y -s NVLSwitch_Full.json 
 + 
 +nvfwupd -t ip=<NVL_SW_BMC_IP> user=<USER> password=<PASS> servertype=NVL_SWITCH \ 
 +  activate_fw -c RF_AUX_PWR_CYCLE 
 +</code> 
 + 
 +=== Step B7 — Final rack-level AC cycle (GB200 NVL72) === 
 + 
 +For GB200 NVL72 the final AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations: 
 +  - Schedule a full rack power-down window. 
 +  - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised). 
 +  - Wait **10 minutes** (longer than the 5–6 min for standalone DGX due to BBU/standby capacitance)
 +  - Restore power; allow 15 minutes for the full rack to come up before validation. 
 + 
 +==== 7.2 Standalone PCIe GPU — NVFlash ==== 
 + 
 +=== Step 2.11 — Unload the NVIDIA Driver === 
 + 
 +<code bash> 
 +sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
 lsmod | grep nvidia lsmod | grep nvidia
 </code> </code>
  
-=== Step 2.— Flash the VBIOS ===+=== Step 2.12 — Flash the VBIOS ===
  
 <code bash> <code bash>
-# Flash GPU at index 0 — increment --index for additional GPUs (1, 2, ...) 
 sudo nvflash --index=0 new_vbios_firmware.rom sudo nvflash --index=0 new_vbios_firmware.rom
-# NVFlash will prompt — review current vs new version shown, then type "yes" 
 </code> </code>
  
-=== Step 2.— Reboot ===+=== Step 2.13 — Reboot ===
  
 <code bash> <code bash>
Line 356: Line 638:
  
 <code bash> <code bash>
-# DGX / HGX 
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version
- 
-# Standalone PCIe 
 nvidia-smi -q | grep "VBIOS Version" nvidia-smi -q | grep "VBIOS Version"
 </code> </code>
  
-Every updated component must now show the new version.+Every component must report the new version. Pay particular attention to **EROT** and **CPLD** — if either still reports the old version, the AC power cycle was not effective; repeat Step 2.9.
  
 ==== Step 3.2 — Verify GPU Detection ==== ==== Step 3.2 — Verify GPU Detection ====
Line 370: Line 649:
 nvidia-smi -L nvidia-smi -L
 </code> </code>
- 
-All expected GPUs must appear with correct model names. No ''Unknown'' or error entries. 
  
 ==== Step 3.3 — Run Health Check ==== ==== Step 3.3 — Run Health Check ====
  
 <code bash> <code bash>
-nvidia-smi -q        # Check per-GPU: VBIOS version, ECC mode, temperature, power draw +nvidia-smi -q 
-nvsm show health     # DGX only — all components must show OK+nvsm show health
 </code> </code>
  
-^  Field  ^  Expected +==== Step 3.4 — Check NVLink ====
-| VBIOS Version  | Matches new package version +
-| ECC Mode  | Enabled +
-| Temperature  | < 50°C at idle  | +
-| Power Draw  | Within TDP spec  | +
- +
-==== Step 3.4 — Check NVLink (DGX / HGX Only) ====+
  
 <code bash> <code bash>
 nvidia-smi nvlink -s -i 0 nvidia-smi nvlink -s -i 0
 </code> </code>
- 
-All NVLink ports must show **Active**. Any inactive port must be investigated before restoring workloads. 
  
 ==== Step 3.5 — Smoke Test ==== ==== Step 3.5 — Smoke Test ====
Line 399: Line 668:
 python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))" python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
 </code> </code>
- 
-Expected: ''True'' and the correct GPU model name. 
  
 ==== Step 3.6 — Record Final State ==== ==== Step 3.6 — Record Final State ====
  
-^  Component  ^  Pre-Upgrade  ^  Post-Upgrade  ^  Status  +^  Component   ^  Pre-Upgrade  ^  Post-Upgrade  ^  Status      
-VBIOS      | Pass / Fail  | +BMC          |                              | Pass / Fail  | 
-| NVSwitch FW  |     | Pass / Fail  | +SBIOS        |                              | Pass / Fail  | 
-| EROT FW      | Pass / Fail  | +| CPLD                                      | Pass / Fail  | 
-| FPGA FW      | Pass / Fail  | +| VBIOS        |                              | Pass / Fail  | 
-BMC      | Pass / Fail  | +| NVSwitch FW  |                              | Pass / Fail  | 
-System BIOS      | Pass / Fail  |+| EROT FW                                   | Pass / Fail  | 
 +| FPGA FW                                   | Pass / Fail  | 
 +PSU          |                              | Pass / Fail  | 
 +NIC          |                              | Pass / Fail  | 
 +TPM          |                              | Pass / Fail  | 
 +NVMe                                      | Pass / Fail  |
  
-==== Step 3.7 — Restore Workloads ==== +==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services ====
- +
-Only proceed after all checks above pass:+
  
 <code bash> <code bash>
-kubectl uncordon <node-name> +# Diff against the pre-upgrade SEL captured in Section 4 
-</code>+ipmitool -I lanplus -H <BMC_IP> -U <USER-P <PASSsel list \ 
 +  > postupgrade_sel_$(date +%Y%m%d).log 
 +diff preupgrade_sel_*.log postupgrade_sel_*.log
  
-  * Restart inference services and training jobs +Restart fabric manager and confirm healthy 
-  * Re-enable GPU monitoring scripts+sudo systemctl start nvidia-fabricmanager 
 +sudo systemctl status nvidia-fabricmanager   # must be active (running)
  
-Monitor the system for 30–60 minutes after restoring workloads.+# Re-enable MIG only if your workload requires it 
 +# sudo nvidia-smi -mig 1 
 +</code>
  
-----+Investigate any new SEL entries (assertion events, thermal trips, fan failures) before restoring workloads.
  
-===== 9Rollback / Downgrade ===== +==== Step 3.8 — Restore Workloads ====
- +
-__Only downgrade if the new firmware causes a confirmed regression. Raise an NVIDIA Support case before attempting. __ +
- +
-==== 9.1 Standalone PCIe GPUs (NVFlash) ====+
  
 <code bash> <code bash>
-# Unload driver +kubectl uncordon <node-name>
-sudo rmmod nvidia_uvm +
-sudo rmmod nvidia_drm +
-sudo rmmod nvidia_modeset +
-sudo rmmod nvidia +
- +
-# Flash the saved backup +
-sudo nvflash --index=0 gpu0_backup_<date>.rom +
- +
-# Reboot +
-sudo reboot+
 </code> </code>
  
-Then re-run all Phase 3 validation steps.+Monitor for 30–60 minutes.
  
-==== 9.2 DGX / HGX Systems (nvfwupd) ====+----
  
-<code bash> +===== 9. Rollback / Downgrade =====
-# Step 1 — Enable ForceUpdate flag (required to allow downgrade) +
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable+
  
-# Step 2 — Confirm flag is active +Same flow as the forward upgrade, but with the previous .fwpkg and ''force_update enable'' set first. Disable ''force_update'' immediately after. EROT/CPLD downgrades also require the AC cycle.
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status+
  
-# Step 3 — Flash the previous firmware version +----
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS>+
-  update_fw -p nvfw_DGX-HGX-H100-H200x8_<old_version>.fwpkg -y+
  
-# Step 4 — Wait for BackgroundCopyStatus: Completed, then cold reboot (same as Steps 2.2–2.3)+===== 10Troubleshooting =====
  
-# Step 5 — MANDATORYDisable ForceUpdate immediately after +^  Symptom                                  ^  Cause                                      ^  Action 
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable+| ''Invalid firmware file''                 | Wrong .fwpkg                                | Re-check ''show_version -p'' against the component 
 +| ''servertype not recognised'' / Redfish 404 | nvfwupd defaulting to wrong server type   | Add ''servertype=DGX'' (or ''HGX'') to the ''-t'' target string | 
 +| VBIOS update blocked — GPU activity       | Process still using the GPU                 | Kill PIDs from ''nvidia-smi --query-compute-apps''; drain K8s node | 
 +| BackgroundCopyStatusFailed              | Wrong firmware file or transient BMC error  | Do **not** reboot. Retry. Contact NVIDIA Support 
 +| EROT/CPLD version unchanged after reboot  | AC power cycle skipped or too short          | Repeat Step 2.9 — full AC removal, 5–6 minute wait  | 
 +| Server does not boot after AC cycle       | Failed flash                                | BMC serial console + ''ipmitool sel list''; contact NVIDIA Support 
 +| GPU missing after upgrade                 | PCIe or driver issue                        | ''lspci'', ''dmesg'' for Xid; reload driver 
 +| NVLink ports Inactive                     | NVSwitch FW or fabric manager               | Cold reboot; restart ''nvidia-fabricmanager'' 
 +| Firmware unchanged after reboot           | Rebooted before background copy finished    | Re-run, wait for ''Completed'' 
 +| ''Downgrade not allowed''                 ForceUpdate flag not enabled                | See Section 9  | 
 +| BMC unreachable after a version-hop skip  | Skipped the v1.1.3 intermediate hop (v1.0.0/v1.1.1 → v24.09.1 direct) | Factory-reset the BMC, then re-stage v1.0.0/v1.1.1 → v1.1.3 → v24.09.1; see "BMC Recovery" below  | 
 +| PSU update refuses to start               | PSU redundancy degraded                     | ''nvsm show health | grep -i psu''; replace failed PSU before retrying 
 +| Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch              | Confirm driver ≥ release notes min; ''journalctl -u nvidia-fabricmanager''; reload ''nvidia'' module 
 +| New SEL entries after upgrade             | Thermal/fan/PSU event during reboot         | Investigate each entry; do not restore workloads until cleared  |
  
-# Step 6 — Confirm it is disabled +==== BMC Recovery (Unreachable BMC) ====
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status +
-</code>+
  
-Then re-run all Phase 3 validation steps.+If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH), **do not attempt field recovery from a USB image**. The BMC recovery image, layout, and procedure are NVIDIA-internal and SKU-specific; running the wrong image can permanently brick the chassis BMC and is not covered by warranty.
  
----- +Instead:
- +
-===== 10. Troubleshooting =====+
  
-^  Symptom  ^  Cause  ^  Action  ^ +  - **Open an NVIDIA Enterprise Support case** with severity matching your maintenance window. 
-| "Invalid firmware file" error  | Wrong .fwpkg for the component  | GPU tray = HGX package; Motherboard tray = DGX package. Re-run ''show_version'' to confirm | +  - **Capture the BMC serial console** at **115200 8N1** (rear-panel serial port or serial-over-USB on the management cable) from the moment the BMC fails through any attempted recoveryAttach the full log to the support case — NVIDIA will require it before authorising any recovery image
-| VBIOS update blocked — GPU activity  | Process still using the GPU  | Run ''nvidia-smi --query-compute-apps'' and kill all PIDs. Drain K8s node. Retry update | +  - Provide: chassis SKU, current BMC version (last known good), target BMC version, the package P-number / file name attempted, and confirmation of whether the v1.0.0 / v1.1.1 → v1.1.3 → v24.09.1 hop was followed
-| BackgroundCopyStatus: Failed  | Wrong firmware file or transient BMC error  | Do **not** rebootRetry the update commandContact NVIDIA Support if retry fails | +  - Do not pull the chassis from the rack or attempt a factory reset until NVIDIA confirms the recovery path — some failures are recoverable from the host side without touching the BMC.
-| Server does not boot after reboot  | Failed or incomplete VBIOS flash  | Connect to BMC serial consoleCheck ''ipmitool sel list''Contact NVIDIA Support — do not power cycle again | +
-| GPU missing after upgrade  | PCIe or driver issue  | Check PCIe with ''lspci''Check ''dmesg'' for Xid errorsReload driverRun ''nvsm show health''+
-| NVLink ports show Inactive  | NVSwitch FW issue or fabric manager down  | Cold rebootRestart ''nvidia-fabricmanager'' serviceContact NVIDIA Support if still inactive | +
-| Firmware version unchanged after reboot | Rebooted before background copy finished | Re-run the updateWait for BackgroundCopyStatus: CompletedReboot again | +
-| "Downgrade not allowed"  | ForceUpdate flag not enabled  | Follow Section 9.2 — enable ForceUpdate before running the downgrade command |+
  
 ---- ----
Line 491: Line 751:
 ===== 11. Risk Register ===== ===== 11. Risk Register =====
  
-^  Risk  ^  Likelihood  ^  Impact  ^  Mitigation +^  Risk                                       ^  Likelihood  ^  Impact      ^  Mitigation 
-| Wrong firmware file flashed  | Medium  | Critical — GPU permanently bricked  Run ''show_version'' first; verify filename matches GPU model; check checksum | +| Wrong firmware file flashed                 | Medium       | Critical     | ''show_version'' first; checksum  
-| Power loss during flash  | Low  | Critical — GPU permanently bricked  | Stable power onlynever upgrade during power maintenance work  | +| Power loss during flash                     | Low          | Critical     | Stable power; avoid power maintenance windows  | 
-| Workload active during upgrade  | Medium  | High — update fails or GPU degraded | Verify ''nvidia-smi --query-compute-apps'' output is empty before starting +| Workload active during upgrade              | Medium       | High         | Verify ''nvidia-smi'' apps query empty  
-| Driver / firmware version mismatch  | Medium  | High — GPU fails to initialise  | Check minimum driver requirements table before upgrading  | +| Driver / firmware mismatch                  | Medium       | High         | Check per-package release notes  | 
-| Network loss to BMC mid-update  | Low  | High — incomplete flash  Use wired network; avoid VPN or WiFi for BMC management traffic +| Network loss to BMC mid-update              | Low          | High         Wired network only  
-| Previous firmware package unavailable  | Medium  | High — no rollback path  | Save previous .fwpkg before every upgrade +| Previous .fwpkg unavailable                 | Medium       | High         | Save before every upgrade 
-| ForceUpdate flag left enabled  | Low  | Medium — security risk  Always disable and verify immediately after any downgrade +| ForceUpdate left enabled                    | Low          | Medium       Disable + verify after downgrade  | 
-| Reboot before background copy completes | Medium  | High — firmware does not activate  Always poll BackgroundCopyStatus until ''Completed'' before rebooting |+| AC cycle skipped — EROT/CPLD not active     | High         | High         | AC step is mandatory in Phase 2  | 
 +| BMC version-hop skipped (v1.0.0/v1.1.1 → v24.09.1 direct) | Medium | Critical | Stage v1.0.0/v1.1.1 → v1.1.3 → v24.09.1; direct skip requires a BMC factory reset to recover  | 
 +| Reboot before background copy completes     | Medium       | High         Poll Redfish until ''Completed''  |
  
 ---- ----
Line 505: Line 767:
 ===== 12. References ===== ===== 12. References =====
  
-  * [[https://docs.nvidia.com/dgx/dgxh100-fw-update-guide/|NVIDIA DGX H100/H200 Firmware Update Guide]] — primary reference: nvfwupd commands, update sequence, downgrade, troubleshooting +  * [[https://docs.nvidia.com/dgx/dgxh100-fw-update-guide/|NVIDIA DGX H100/H200 Firmware Update Guide]] 
-  * [[https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/|NVIDIA DGX A100 Firmware Update Container Release Notes]] +  * Per-package release notes — **always consult for driver minimums and known issues** 
-  * [[https://enterprise-support.nvidia.com|NVIDIA Enterprise Support Portal]] — firmware package downloads (login required)+  * [[https://enterprise-support.nvidia.com|NVIDIA Enterprise Support Portal]]
  
 ---- ----
- 
  
wiki/ai/gpu_firmware_upgrade.1778261107.txt.gz · Last modified: by smadanagopal