User Tools

Site Tools


wiki:ai:gpu_firmware_upgrade

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
wiki:ai:gpu_firmware_upgrade [2026/05/11 12:53] smadanagopalwiki:ai:gpu_firmware_upgrade [2026/05/12 10:00] (current) smadanagopal
Line 6: Line 6:
   - **Validate Package** — checksum the firmware file, preview what will change   - **Validate Package** — checksum the firmware file, preview what will change
   - **Phase 2 (in this exact order):**   - **Phase 2 (in this exact order):**
-    - BMC firmware update +      - BMC firmware update 
-    - BMC reset (mc reset cold) +      - BMC reset (mc reset cold) 
-    - Motherboard tray update (SBIOS, CPLD, PSU, etc.) +      - Motherboard tray update (SBIOS, CPLD, PSU, etc.) 
-    - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) +      - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) 
-    - Poll BackgroundCopyStatus until **Completed** +      - Poll BackgroundCopyStatus until **Completed** 
-    - Chassis power cycle (graceful, via Redfish/ipmitool) +      - Chassis power cycle (graceful, via Redfish/ipmitool) 
-    - Re-check versions on staged components +      - Re-check versions on staged components 
-    - CommitImage on **HGX_FW_BMC_0** and **HGX_FW_HostBIOS_0** +      - CommitImage on **HostBMC_0** and **HostBIOS_0** 
-    - **AC power cycle** (full PDU/cord removal, 5–6 min) +      - **AC power cycle** (full PDU/cord removal, 5–6 min) 
-    - NICTPM, NVMe firmware (last — these are independent and tolerate the AC cycle)+      - NIC firmware via ''mlxfwmanager'' on the host OS; TPM, NVMe via BMC (last — independent of the EROT/CPLD chain)
   - **Validate** — confirm new versions, GPU health, NVLink, smoke test   - **Validate** — confirm new versions, GPU health, NVLink, smoke test
  
-<note warning> Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing.+<note warning> 
 +Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing.
  
-**Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. </note>+**Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. 
 +</note>
  
 ---- ----
Line 29: Line 31:
  
 **Scope — what this runbook covers:** **Scope — what this runbook covers:**
- 
   * **Hopper:** DGX H100/H200, HGX H100/H200   * **Hopper:** DGX H100/H200, HGX H100/H200
   * **Blackwell:** DGX B200, HGX B200, GB200 NVL72 compute tray (per-tray BMC), GB300   * **Blackwell:** DGX B200, HGX B200, GB200 NVL72 compute tray (per-tray BMC), GB300
Line 36: Line 37:
  
 **Blackwell-specific notes (GB200 / B200):** **Blackwell-specific notes (GB200 / B200):**
- +  * GB200 NVL72 racks are driven from the **NVIDIA Mission Control** rack-firmware tooling. Refer to the Mission Control rack-firmware document shipped with the bundle — section 7.1B below summarises the flow but does **not** replace it. 
-  * GB200 NVL72 racks have **two BMCs per compute tray** (HMC for the GPU/CPU complex and a separate chassis BMC) plus an **NVLink Switch tray BMC**. Update **HMC first**, then chassis BMC, then NVLink Switch tray, then compute tray firmware. +  Bundles arrive as **two NVIDIA-released packages** plus the NVLink Switch tray package: ''nvfw_BMC_<P-number>.fwpkg'' and ''nvfw_Compute_<P-number>.fwpkg'' (Mission Control names them by NVIDIA "P-number" — e.g., **P4972** = early Blackwell baseline, **P4975** = first GA refresh, **P4978** = current — and selector JSONs are named ''BMC_Full.json'' and ''Compute_Full.json''). 
-  * **Grace CPU firmware** (''HGX_FW_GraceCPU_0''''HGX_FW_GraceCPU_1''is updated with the SBIOS package on BlackwellIt activates on the chassis power cycle (no separate AC requirement). +  * **Ordering, simplified:** chassis BMC (BMC_Full) → **AC cycle between BMC and HMC updates** → HMC + GPU/Grace complex (Compute_Full→ NVLink Switch tray → final rack-level AC cycle for EROT/CPLD. The AC cycle in the middle is required so the new chassis BMC re-enumerates the HMC under the updated Redfish schema before Compute_Full is pushed
-  * **NVLink Switch tray** firmware (''NVSwitch_Tray_*'') is a separate .fwpkg and is updated from the rack-level management hostnot the compute tray BMC+  * **Activation:** Mission Control uses the documented activation method ''activate_fw -c RF_AUX_PWR_CYCLE'' (auxiliary-rail power cycle via Redfish)which replaces the ad-hoc "''chassis power cycle''" step that Hopper uses
-  * The mandatory **AC power cycle** for EROT/CPLD also applies on Blackwell — for GB200 NVL72 this means power-cycling the full rack PDU feeds (coordinate with DC ops). +  * The final mandatory **rack-level AC cycle** for EROT/CPLD still applies — for GB200 NVL72 this means power-cycling the full rack PDU feeds (coordinate with DC ops). 
-  * **nvfwupd ≥ 3.0.0** is required for Blackwell (HMC + Grace + NVLink Switch tray Redfish paths)2.0.4 is the floor for Hopper only.+  * **nvfwupd floor:** 2.0.4 for Hopper; **2.0.5 for Blackwell** (HMC + Grace + NVLink Switch tray Redfish paths), or **2.0.9** if you intend to run parallel multi-package updates across the rack.
  
 **Scope — what this runbook does NOT cover:** **Scope — what this runbook does NOT cover:**
- 
   * DGX A100 (use the Firmware Update Container — separate runbook)   * DGX A100 (use the Firmware Update Container — separate runbook)
   * Standalone Mellanox/ConnectX adapter firmware in non-DGX hosts (use ''mlxfwmanager'' separately)   * Standalone Mellanox/ConnectX adapter firmware in non-DGX hosts (use ''mlxfwmanager'' separately)
Line 56: Line 56:
 ==== Components Updated ==== ==== Components Updated ====
  
-^  Component  ^  What It Controls  ^  Activation Trigger  +^  Component   ^  What It Controls                            ^  Activation Trigger             
-| BMC  | Out-of-band server management  | BMC reset (mc reset cold)  +| BMC          | Out-of-band server management                | BMC reset (mc reset cold)       
-| SBIOS / UEFI | Motherboard initialisation  | Chassis power cycle + CommitImage | +| SBIOS / UEFI | Motherboard initialisation                   | Chassis power cycle + CommitImage | 
-| CPLD  | Board-level glue logic, power sequencing  | **AC power cycle (mandatory)** +| CPLD         | Board-level glue logic, power sequencing     | **AC power cycle (mandatory)** 
-| VBIOS  | GPU boot, clocks, power limits, PCIe config  | Chassis power cycle  +| VBIOS        | GPU boot, clocks, power limits, PCIe config  | Chassis power cycle             
-| NVSwitch FW  | GPU-to-GPU interconnect (NVLink fabric)  | Chassis power cycle  +| NVSwitch FW  | GPU-to-GPU interconnect (NVLink fabric)      | Chassis power cycle             
-| EROT  | Hardware root of trust / secure boot  | **AC power cycle (mandatory)** +| EROT         | Hardware root of trust / secure boot         | **AC power cycle (mandatory)** 
-| FPGA  | Power management and control plane  | Chassis power cycle  +| FPGA         | Power management and control plane           | Chassis power cycle             
-| PSU  | Power supply unit firmware  | Live, one at a time — redundant PSU carries load. Refuses to flash if redundancy is unhealthy. |+| PSU          | Power supply unit firmware                   | Live, one at a time — redundant PSU carries load. Refuses to flash if redundancy is unhealthy. |
 | NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/8 on Blackwell) | Chassis power cycle (run AFTER the AC cycle in this runbook) | | NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/8 on Blackwell) | Chassis power cycle (run AFTER the AC cycle in this runbook) |
-| TPM  | Trusted Platform Module  | Chassis power cycle (run AFTER the AC cycle in this runbook) | +| TPM          | Trusted Platform Module                      | Chassis power cycle (run AFTER the AC cycle in this runbook) | 
-| NVMe  | Onboard storage firmware  | Chassis power cycle (run AFTER the AC cycle in this runbook) |+| NVMe         | Onboard storage firmware                     | Chassis power cycle (run AFTER the AC cycle in this runbook) |
  
-<note important> EROT and CPLD images are staged into a secondary flash region by the update, but only activate on a full AC power removal — a warm reboot or chassis power cycle is not sufficient. </note>+<note important> 
 +EROT and CPLD images are staged into a secondary flash region by the update, but only activate on a full AC power removal — a warm reboot or chassis power cycle is not sufficient. 
 +</note>
  
 ==== Driver Compatibility ==== ==== Driver Compatibility ====
Line 84: Line 86:
 ==== Time Budget (typical per-node) ==== ==== Time Budget (typical per-node) ====
  
-^  Phase  ^  Estimated Time  ^ +^  Phase                                ^  Estimated Time  ^ 
-| Pre-upgrade preparation (Phase 1)  | 20 min  +| Pre-upgrade preparation (Phase 1)     | 20 min           
-| BMC update + reset + verify (2.1–2.2) | 10 min  +| BMC update + reset + verify (2.1–2.2) | 10 min           
-| Motherboard tray update (2.3)  | 15 min  +| Motherboard tray update (2.3)         | 15 min           
-| GPU tray update (2.4)  | 25 min  +| GPU tray update (2.4)                 | 25 min           
-| BackgroundCopyStatus poll (2.5)  | 10–20 min  +| BackgroundCopyStatus poll (2.5)       | 10–20 min        
-| Chassis power cycle + recheck (2.6–2.7) | 10 min  +| Chassis power cycle + recheck (2.6–2.7) | 10 min         
-| CommitImage (2.8)  | 2 min  +| CommitImage (2.8)                     | 2 min            
-| AC power cycle (2.9, includes 6m drain) | 20 min  +| AC power cycle (2.9, includes 6m drain) | 20 min         
-| NIC / TPM / NVMe + chassis cycle (2.10) | 15 min  +| NIC / TPM / NVMe + chassis cycle (2.10) | 15 min         
-| Validation (Phase 3)  | 15 min  +| Validation (Phase 3)                  | 15 min           
-| **Total — Hopper DGX/HGX**  | **~2h 30m**  |+| **Total — Hopper DGX/HGX**            | **~2h 30m**      |
 | GB200 NVL72 rack (add HMC, NVLink Switch tray, rack PDU cycle 10m+15m boot) | **add 1h 15m** | | GB200 NVL72 rack (add HMC, NVLink Switch tray, rack PDU cycle 10m+15m boot) | **add 1h 15m** |
  
Line 106: Line 108:
 ==== Tool Overview ==== ==== Tool Overview ====
  
-^  Tool  ^  Purpose  ^  Where It Runs  +^  Tool          ^  Purpose                                                        ^  Where It Runs           
-| **nvfwupd**  | Full firmware bundle update for DGX/HGX via BMC Redfish API  | Management host (remote) | +| **nvfwupd**    | Full firmware bundle update for DGX/HGX via BMC Redfish API     | Management host (remote) | 
-| **NVFlash**  | VBIOS-only flash for standalone PCIe GPUs  | GPU host (root required) | +| **NVFlash**    | VBIOS-only flash for standalone PCIe GPUs                       | GPU host (root required) | 
-| **nvidia-smi** | Query GPU health, firmware version, running processes  | GPU host  +| **nvidia-smi** | Query GPU health, firmware version, running processes           | GPU host                 
-| **nvsm**  | DGX system-wide health check  | GPU host  +| **nvsm**       | DGX system-wide health check                                    | GPU host                 
-| **ipmitool**  | Remote power cycle and BMC reset  | Management host  +| **ipmitool**   | Remote power cycle and BMC reset                                | Management host          
-| **curl**  | Redfish polling (BackgroundCopyStatus, CommitImage)  | Management host  |+| **curl**       | Redfish polling (BackgroundCopyStatus, CommitImage)             | Management host          |
  
 ==== nvfwupd Version Requirement ==== ==== nvfwupd Version Requirement ====
  
-  * **Hopper (H100/H200) minimum: nvfwupd 2.0.4** — earlier versions do not support CommitImage or the HGX_FW_* inventory paths. +  * **Hopper (H100/H200) minimum: nvfwupd 2.0.4** — earlier versions do not support CommitImage or the inventory paths used below
-  * **Blackwell (B200 / GB200 / GB300) minimum: nvfwupd 3.0.0** — required for HMC, Grace CPU, and NVLink Switch tray Redfish paths.+  * **Blackwell (B200 / GB200 / GB300) minimum: nvfwupd 2.0.5** — required for HMC, Grace CPU, and NVLink Switch tray Redfish paths. Use **2.0.9** if you plan to push multiple packages in parallel from the rack management host (BMC_Full + Compute_Full).
   * Confirm with: ''nvfwupd --version''   * Confirm with: ''nvfwupd --version''
   * Download from the NVIDIA Enterprise Support Portal.   * Download from the NVIDIA Enterprise Support Portal.
Line 136: Line 138:
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable
 +
 +# Blackwell / Mission Control activation (replaces ipmitool mc reset cold)
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \
 +  activate_fw -c RF_AUX_PWR_CYCLE
 +
 +# --- mlxfwmanager (host OS, Hopper NIC updates) ---
 +sudo mlxfwmanager --query
 +sudo mlxfwmanager -u -i fw-ConnectX7-rel-<version>.bin --yes
  
 # --- NVFlash --- # --- NVFlash ---
Line 164: Line 174:
 ===== 4. Prerequisites ===== ===== 4. Prerequisites =====
  
-<note important> Complete every item below before starting. Do not proceed if any item cannot be confirmed. </note>+<note important> 
 +Complete every item below before starting. Do not proceed if any item cannot be confirmed. 
 +</note>
  
 ==== Access & Credentials ==== ==== Access & Credentials ====
- 
   * [ ] BMC IP address, username, and password (administrator privileges)   * [ ] BMC IP address, username, and password (administrator privileges)
   * [ ] SSH access to the target server   * [ ] SSH access to the target server
Line 174: Line 185:
  
 ==== Tooling ==== ==== Tooling ====
- 
   * [ ] **nvfwupd ≥ 2.0.4** installed on management host   * [ ] **nvfwupd ≥ 2.0.4** installed on management host
   * [ ] ipmitool, curl, jq available on management host   * [ ] ipmitool, curl, jq available on management host
  
 ==== Firmware Package ==== ==== Firmware Package ====
- 
   * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages)   * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages)
   * [ ] SHA256 checksum verified against release notes   * [ ] SHA256 checksum verified against release notes
Line 186: Line 195:
  
 ==== BMC Version Hop (pre-1.1.3 systems) ==== ==== BMC Version Hop (pre-1.1.3 systems) ====
- +  * [ ] If current BMC firmware is **v1.0.0 or v1.1.1**, you **must** stage to **v1.1.3** first, reset BMC, and verify, before attempting **v24.09.1** (or any later bundle). The full hop is **v1.0.0 / v1.1.1 → v1.1.3 → v24.09.1**. Skipping the intermediate hop requires a **factory reset** of the BMC to recover
-  * [ ] If current BMC firmware is **older than 1.1.3**, you **must** first update to the 1.1.3 intermediate package, reset BMC, and verify, before installing any 1.2.x+ bundle. Skipping this hop will brick the Redfish endpoint+  * [ ] Confirm current BMC version (pin to the BMC Redfish path, not a substring match): 
-  * [ ] Confirm current BMC version: ''nvfwupd ... show_version | grep -i bmc''+    <code bash> 
 +    nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version 
 +      | grep -'(^|\s)HostBMC_0\b' 
 +    </code>
  
 ==== Environment ==== ==== Environment ====
- 
   * [ ] All GPU workloads stopped   * [ ] All GPU workloads stopped
   * [ ] ''nvidia-smi'' is not running (passive queries block VBIOS updates)   * [ ] ''nvidia-smi'' is not running (passive queries block VBIOS updates)
Line 204: Line 215:
  
 ==== System State ==== ==== System State ====
- 
   * [ ] Management host can ping the BMC IP   * [ ] Management host can ping the BMC IP
   * [ ] Server on stable power   * [ ] Server on stable power
Line 211: Line 221:
  
 ==== Rollback Readiness ==== ==== Rollback Readiness ====
- 
   * [ ] Current firmware versions recorded in change ticket   * [ ] Current firmware versions recorded in change ticket
   * [ ] Previous firmware .fwpkg saved   * [ ] Previous firmware .fwpkg saved
Line 220: Line 229:
 ===== 5. Best Practices ===== ===== 5. Best Practices =====
  
-^  #  ^  Practice  ^  Why It Matters  +^  #  ^  Practice                                       ^  Why It Matters                                               
-| 1  | Always run ''show_version'' first  | Confirms exactly what will change  +| 1   | Always run ''show_version'' first                | Confirms exactly what will change                             
-| 2  | Back up VBIOS before flashing  | No backup = no rollback path  +| 2   | Back up VBIOS before flashing                    | No backup = no rollback path                                  
-| 3  | Stop all GPU processes including ''nvidia-smi''  | Active processes block or corrupt VBIOS updates  +| 3   | Stop all GPU processes including ''nvidia-smi''  | Active processes block or corrupt VBIOS updates               
-| 4  | Never interrupt mid-flash  | Closing terminal or losing power bricks the GPU  +| 4   | Never interrupt mid-flash                        | Closing terminal or losing power bricks the GPU               
-| 5  | Verify the package checksum  | A corrupted .fwpkg silently fails  +| 5   | Verify the package checksum                      | A corrupted .fwpkg silently fails                             
-| 6  | Wait for BackgroundCopyStatus: Completed  | Rebooting early means firmware never activates  +| 6   | Wait for BackgroundCopyStatus: Completed         | Rebooting early means firmware never activates                
-| 7  | **Always AC-cycle after CommitImage**  | EROT and CPLD only activate on full AC removal  +| 7   | **Always AC-cycle after CommitImage**            | EROT and CPLD only activate on full AC removal                
-| 8  | Use a wired management network  | VPN or WiFi drops leave flash incomplete  +| 8   | Use a wired management network                   | VPN or WiFi drops leave flash incomplete                      
-| 9  | Disable ForceUpdate immediately after downgrade  | Leaving it enabled is a security risk  +| 9   | Disable ForceUpdate immediately after downgrade  | Leaving it enabled is a security risk                         
-| 10  | Update BMC first, everything else after  | Subsequent components depend on the new BMC Redfish schema  +| 10  | Update BMC first, everything else after          | Subsequent components depend on the new BMC Redfish schema    
-| 11  | Validate before restoring workloads  | Confirms upgrade succeeded  +| 11  | Validate before restoring workloads              | Confirms upgrade succeeded                                    
-| 12  | Document pre and post versions  | Required for change management  |+| 12  | Document pre and post versions                   | Required for change management                                |
  
 ---- ----
Line 251: Line 260:
 Copy full output to your change ticket and fill in: Copy full output to your change ticket and fill in:
  
-^  Component  ^  Pre-Upgrade Version  ^  Target Version +^  Component   ^  Pre-Upgrade Version  ^  Target Version 
-| BMC      +| BMC                                                 
-| SBIOS      +| SBIOS                                               
-| CPLD      +| CPLD                                                
-| VBIOS      +| VBIOS                                               
-| NVSwitch FW  |     +| NVSwitch FW  |                                        
-| EROT FW      +| EROT FW                                             
-| FPGA FW      +| FPGA FW                                             
-| PSU      +| PSU                                                 
-| NIC      +| NIC                                                 
-| TPM      +| TPM                                                 
-| NVMe      |+| NVMe                                                |
  
 ==== Step 1.2 — Verify GPU Health ==== ==== Step 1.2 — Verify GPU Health ====
Line 317: Line 326:
 ===== 7. Phase 2 — Firmware Upgrade Execution ===== ===== 7. Phase 2 — Firmware Upgrade Execution =====
  
-<note warning> **Strict ordering — do not deviate.** BMC first, then motherboard tray, then GPU tray, then poll, then chassis power cycle, then CommitImage, then AC cycle, then NIC/TPM/NVMe. </note>+<note warning> 
 +**Strict ordering — do not deviate.** BMC first, then motherboard tray, then GPU tray, then poll, then chassis power cycle, then CommitImage, then AC cycle, then NIC/TPM/NVMe. 
 +</note>
  
 ==== 7.1 DGX / HGX — nvfwupd ==== ==== 7.1 DGX / HGX — nvfwupd ====
  
 === Step 2.1 — Update BMC === === Step 2.1 — Update BMC ===
 +
 +Use the documented one-shot BMC selector. Do not hand-enumerate child paths — let nvfwupd resolve them from the package.
  
 <code bash> <code bash>
Line 327: Line 340:
 { {
   "Targets": [   "Targets": [
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_BMC_0"+    "/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"
   ]   ]
 } }
Line 342: Line 355:
 sleep 180 sleep 180
 ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info
-nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version | grep -i bmc+ 
 +# Pin the recheck to the BMC Redfish path (not a substring match on "bmc", 
 +# which also matches HGX_FW_BMC_* aggregate entries on some packages). 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version 
 +  | grep -E '(^|\s)HostBMC_0\b'
 </code> </code>
  
Line 348: Line 365:
  
 === Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) === === Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) ===
 +
 +The motherboard tray uses the **empty selector** (''{}''), which tells nvfwupd to apply every motherboard-side component the package declares. Do not hand-enumerate ''HostBIOS_0'' / ''CPLDMB_0'' / ''PSU_*'' — the path names vary by SKU and package, and a typo silently skips the component.
  
 <code bash> <code bash>
 cat > selector_mb.json <<'EOF' cat > selector_mb.json <<'EOF'
-{ +{}
-  "Targets":+
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_HostBIOS_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_CPLDMB_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_PSU_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_PSU_1", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_PSU_2", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_PSU_3" +
-  ] +
-}+
 EOF EOF
  
Line 366: Line 376:
   update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_mb.json   update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_mb.json
 </code> </code>
 +
 +Cross-check beforehand with ''show_version -p <pkg.fwpkg>'' to confirm which components the package will touch.
  
 === Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) === === Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) ===
 +
 +The GPU tray uses the **HGX_0 one-shot selector**, which targets the entire GPU baseboard inventory in a single call. The BMC fans out internally to all GPUs, NVSwitch chips, EROT, and FPGA — do not enumerate per-GPU paths.
  
 <code bash> <code bash>
Line 373: Line 387:
 { {
   "Targets": [   "Targets": [
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_0", +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_0"
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_1", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_2", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_3", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_4", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_5", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_6", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GPU_SXM_7", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVSwitch_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVSwitch_1", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVSwitch_2", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVSwitch_3", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_ERoT_GPU_SXM_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_FPGA_0"+
   ]   ]
 } }
Line 394: Line 395:
   update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y -s selector_gpu.json   update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y -s selector_gpu.json
 </code> </code>
 +
 +Verify the exact child paths your platform exposes with ''nvfwupd ... show_version'' on a real DGX H100/H200 before relying on inventory names elsewhere in this runbook — they can differ between H100 and H200 chassis SKUs.
  
 === Step 2.5 — Wait for Background Copy to Complete === === Step 2.5 — Wait for Background Copy to Complete ===
Line 428: Line 431:
 curl -k -u <USER>:<PASS> -X POST \ curl -k -u <USER>:<PASS> -X POST \
   -H "Content-Type: application/json" \   -H "Content-Type: application/json" \
-  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_BMC_0"]}' \+  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"]}' \
   https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage   https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage
  
Line 434: Line 437:
 curl -k -u <USER>:<PASS> -X POST \ curl -k -u <USER>:<PASS> -X POST \
   -H "Content-Type: application/json" \   -H "Content-Type: application/json" \
-  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_HostBIOS_0"]}' \+  -d '{"Targets":["/redfish/v1/UpdateService/FirmwareInventory/HostBIOS_0"]}' \
   https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage   https://<BMC_IP>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage
 </code> </code>
Line 440: Line 443:
 === Step 2.9 — AC Power Cycle (MANDATORY) === === Step 2.9 — AC Power Cycle (MANDATORY) ===
  
-<note warning> This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient. </note>+<note warning> 
 +This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient. 
 +</note>
  
   - Power off the host via OS shutdown or ''ipmitool chassis power off''.   - Power off the host via OS shutdown or ''ipmitool chassis power off''.
Line 448: Line 453:
   - Wait 5–10 minutes for full boot.   - Wait 5–10 minutes for full boot.
  
-=== Step 2.10 — Update NIC, TPM, NVMe (and ConnectX-7/8 on Blackwell) ===+=== Step 2.10 — Update NIC (mlxfwmanager), TPM, NVMe ===
  
 These components are independent of the GPU/EROT activation chain and are updated last. These components are independent of the GPU/EROT activation chain and are updated last.
  
-**Hopper (DGX/HGX H100/H200):**+**Hopper (DGX/HGX H100/H200) — NICs via host-OS mlxfwmanager, not Redfish.** On Hopper the ConnectX adapters are not exposed under the HGX BMC's firmware inventory; the BMC will return 404 / "no such target" for ''HGX_FW_NIC_*'' paths. Update them from the host OS instead:
  
 <code bash> <code bash>
-cat selector_periph.json <<'EOF' +# Inventory and current versions on the host OS 
-{ +sudo mlxfwmanager --query 
-  "Targets": + 
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NIC_0", +# Apply the bundled MFT firmware image (matches your ConnectX SKU) 
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NIC_1", +sudo mlxfwmanager -u -i fw-ConnectX7-rel-<version>.bin --yes 
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_TPM_0", + 
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_NVMe_0" +# Re-query to confirm the new FW Version is present and "StatusDone
-  ] +sudo mlxfwmanager --query 
-} +</code> 
-EOF+ 
 +A subsequent chassis power cycle (or PCIe FLR via ''mlxfwreset'') is required to activate the new NIC firmware — do this **after** the AC cycle in Step 2.9. 
 + 
 +TPM and NVMe remain BMC-driven on Hopper. Confirm the exact inventory paths on your chassis with ''nvfwupd ... show_version'' before flashing — names like ''TPM_0'' ''NVMe_0'' (without the ''HGX_FW_'' prefix) appear on some packages. 
 + 
 +<code bash> 
 +# After confirming the actual inventory names on your chassis: 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS>
 +  show_version | grep -Ei '(tpm|nvme)'
  
 +# Then build a selector_tpm_nvme.json with the paths returned above and:
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \
-  update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_periph.json+  update_fw -p nvfw_DGX_<version>.fwpkg -y -s selector_tpm_nvme.json
 </code> </code>
  
Line 491: Line 505:
  
 Then host reboot: Then host reboot:
- 
 <code bash> <code bash>
 ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle
 </code> </code>
  
-==== 7.1B Blackwell — GB200 / B200 Additional Steps ====+==== 7.1B Blackwell — GB200 / B200 (Mission Control flow) ====
  
-Blackwell systems insert the following **before** Step 2.1 (BMC update) and **between** Step 2.and 2.3.+<note important> 
 +This subsection summarises the **NVIDIA Mission Control rack-firmware document** that ships with the Blackwell bundleCross-check every selector path and package P-number against the Mission Control doc for your bundle version before executingThe component path names below are the documented Mission Control names — they are not the same as the Hopper ''HGX_FW_*'' paths and must not be cross-pollinated. 
 +</note>
  
-=== Step 2.0a — Update HMC (GB200 only) ===+Blackwell delivers two top-level packages plus the NVLink Switch tray package:
  
-The HMC manages the GPU/Grace complex on each compute tray and must be updated **before** the chassis BMC.+^  Package name                    ^  Selector JSON        P-number examples         ^ 
 +| ''nvfw_BMC_<P-number>.fwpkg''    | ''BMC_Full.json''    | P4972 (baseline), P4975 (GA refresh), P4978 (current) | 
 +| ''nvfw_Compute_<P-number>.fwpkg''| ''Compute_Full.json''| matched P-number to BMC package                       | 
 +| ''nvfw_NVLSwitch_<P-number>.fwpkg'' | ''NVLSwitch_Full.json'' | (rack-level NVLink Switch tray)                | 
 + 
 +**Ordering (per Mission Control):** 
 + 
 +  - **B1.** Push ''BMC_Full'' to the **chassis BMC**. 
 +  - **B2.** ''activate_fw -c RF_AUX_PWR_CYCLE'' on the chassis BMC. 
 +  - **B3.** **AC power cycle between BMC and HMC updates** (rack PDU cycle for NVL72; chassis cord pull for standalone B200). Wait 5 min for standby drain. This is required so the new chassis BMC re-enumerates the HMC under the updated Redfish schema before Compute_Full is pushed. 
 +  - **B4.** Push ''Compute_Full'' (HMC + Grace CPU FW + GPU/NVSwitch/EROT/FPGA on the compute tray). 
 +  - **B5.** ''activate_fw -c RF_AUX_PWR_CYCLE'' for the Compute_Full payload. 
 +  - **B6.** Push the NVLink Switch tray package from the rack-level management host against the NVLink Switch tray BMC (not the compute tray BMC). 
 +  - **B7.** Final **rack-level AC cycle** (Step 2.9 / 2.9b) to activate EROT and CPLD on every tray. 
 + 
 +=== Step B1 — Push BMC_Full to the chassis BMC === 
 + 
 +The Mission Control selector JSON is a one-shot; do not enumerate child components.
  
 <code bash> <code bash>
-cat > selector_hmc.json <<'EOF'+cat > BMC_Full.json <<'EOF'
 { {
   "Targets": [   "Targets": [
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_HMC_0"+    "/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"
   ]   ]
 } }
 EOF EOF
  
-nvfwupd -t ip=<HMC_IP> user=<USER> password=<PASS> servertype=GB200 \ +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
-  update_fw -p nvfw_GB200_<version>.fwpkg -y -s selector_hmc.json+  update_fw -p nvfw_BMC_P4978.fwpkg -y -s BMC_Full.json
 </code> </code>
  
-Reset HMC and confirm:+=== Step B2 — Activate the chassis BMC payload === 
 + 
 +Blackwell uses the documented Mission Control activation verb, not ''ipmitool mc reset cold'':
  
 <code bash> <code bash>
-curl -k -u <USER>:<PASS> -X POST \ +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
-  https://<HMC_IP>/redfish/v1/Managers/HGX_HMC_0/Actions/Manager.Reset \ +  activate_fw -c RF_AUX_PWR_CYCLE
-  -d '{"ResetType":"GracefulRestart"}' +
-sleep 180 +
-nvfwupd -t ip=<HMC_IP> user=<USER> password=<PASS> show_version | grep -i hmc+
 </code> </code>
  
-=== Step 2.2b — Update Grace CPU Firmware (Blackwell) ===+=== Step B3 — AC cycle between BMC and HMC updates === 
 + 
 +This is **not** the final EROT/CPLD AC cycle — it sits between the two halves of the bundle and is mandatory. 
 + 
 +  - Standalone B200: chassis cord pull, 5 min drain. 
 +  - GB200 NVL72: rack-level PDU cycle (both A-side and B-side feeds), 10 min drain, 15 min boot. 
 +  - Confirm the chassis BMC comes back on the new firmware before proceeding: ''nvfwupd ... show_version | grep -E '(^|\s)HostBMC_0\b'''
 + 
 +=== Step B4 — Push Compute_Full ===
  
-Bundled with the SBIOS package; activates on chassis power cycle.+''Compute_Full'' bundles HMC, Grace CPU FW, GPU/NVSwitch/EROT/FPGA for the compute tray.
  
 <code bash> <code bash>
-cat > selector_grace.json <<'EOF'+cat > Compute_Full.json <<'EOF'
 { {
   "Targets": [   "Targets": [
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GraceCPU_0", +    "/redfish/v1/UpdateService/FirmwareInventory/HGX_0"
-    "/redfish/v1/UpdateService/FirmwareInventory/HGX_FW_GraceCPU_1"+
   ]   ]
 } }
Line 542: Line 580:
  
 nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \
-  update_fw -p nvfw_GB200_<version>.fwpkg -y -s selector_grace.json+  update_fw -p nvfw_Compute_P4978.fwpkg -y -s Compute_Full.json
 </code> </code>
  
-=== Step 2.4b — Update NVLink Switch Tray (GB200 NVL72) ===+=== Step B5 — Activate Compute_Full ===
  
-Run from the rack-level management host against the NVLink Switch tray BMC, **not** the compute tray BMC.+<code bash> 
 +nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> servertype=GB200 \ 
 +  activate_fw -c RF_AUX_PWR_CYCLE 
 +</code> 
 + 
 +=== Step B6 — NVLink Switch tray (GB200 NVL72) === 
 + 
 +Run from the rack-level management host against the NVLink Switch tray BMC. Confirm the exact tray-side selector path names from the Mission Control NVLink-Switch addendum for your bundle — they have changed between P-numbers.
  
 <code bash> <code bash>
-cat selector_nvlsw.json <<'EOF' +nvfwupd -t ip=<NVL_SW_BMC_IPuser=<USER> password=<PASS> servertype=NVL_SWITCH \ 
-{ +  update_fw -p nvfw_NVLSwitch_P4978.fwpkg -y -s NVLSwitch_Full.json
-  "Targets":+
-    "/redfish/v1/UpdateService/FirmwareInventory/NVSwitch_Tray_BMC_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/NVSwitch_Tray_FPGA_0", +
-    "/redfish/v1/UpdateService/FirmwareInventory/NVSwitch_Tray_ERoT_0" +
-  ] +
-+
-EOF+
  
 nvfwupd -t ip=<NVL_SW_BMC_IP> user=<USER> password=<PASS> servertype=NVL_SWITCH \ nvfwupd -t ip=<NVL_SW_BMC_IP> user=<USER> password=<PASS> servertype=NVL_SWITCH \
-  update_fw -p nvfw_NVLSwitch_<version>.fwpkg -y -s selector_nvlsw.json+  activate_fw -c RF_AUX_PWR_CYCLE
 </code> </code>
  
-=== Step 2.9b — Rack AC Cycle (GB200 NVL72) === +=== Step B7 — Final rack-level AC cycle (GB200 NVL72) ===
- +
-For GB200 NVL72 the AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations:+
  
 +For GB200 NVL72 the final AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations:
   - Schedule a full rack power-down window.   - Schedule a full rack power-down window.
   - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised).   - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised).
Line 634: Line 671:
 ==== Step 3.6 — Record Final State ==== ==== Step 3.6 — Record Final State ====
  
-^  Component  ^  Pre-Upgrade  ^  Post-Upgrade  ^  Status  +^  Component   ^  Pre-Upgrade  ^  Post-Upgrade  ^  Status      
-| BMC      | Pass / Fail  | +| BMC                                       | Pass / Fail  | 
-| SBIOS      | Pass / Fail  | +| SBIOS                                     | Pass / Fail  | 
-| CPLD      | Pass / Fail  | +| CPLD                                      | Pass / Fail  | 
-| VBIOS      | Pass / Fail  | +| VBIOS                                     | Pass / Fail  | 
-| NVSwitch FW  |     | Pass / Fail  | +| NVSwitch FW  |                              | Pass / Fail  | 
-| EROT FW      | Pass / Fail  | +| EROT FW                                   | Pass / Fail  | 
-| FPGA FW      | Pass / Fail  | +| FPGA FW                                   | Pass / Fail  | 
-| PSU      | Pass / Fail  | +| PSU                                       | Pass / Fail  | 
-| NIC      | Pass / Fail  | +| NIC                                       | Pass / Fail  | 
-| TPM      | Pass / Fail  | +| TPM                                       | Pass / Fail  | 
-| NVMe      | Pass / Fail  |+| NVMe                                      | Pass / Fail  |
  
 ==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services ==== ==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services ====
Line 683: Line 720:
 ===== 10. Troubleshooting ===== ===== 10. Troubleshooting =====
  
-^  Symptom  ^  Cause  ^  Action +^  Symptom                                  ^  Cause                                      ^  Action 
-| ''Invalid firmware file''  | Wrong .fwpkg  | Re-check ''show_version -p'' against the component +| ''Invalid firmware file''                 | Wrong .fwpkg                                | Re-check ''show_version -p'' against the component 
-| ''servertype not recognised'' / Redfish 404 | nvfwupd defaulting to wrong server type  | Add ''servertype=DGX'' (or ''HGX'') to the ''-t'' target string | +| ''servertype not recognised'' / Redfish 404 | nvfwupd defaulting to wrong server type   | Add ''servertype=DGX'' (or ''HGX'') to the ''-t'' target string | 
-| VBIOS update blocked — GPU activity  | Process still using the GPU  | Kill PIDs from ''nvidia-smi --query-compute-apps''; drain K8s node | +| VBIOS update blocked — GPU activity       | Process still using the GPU                 | Kill PIDs from ''nvidia-smi --query-compute-apps''; drain K8s node | 
-| BackgroundCopyStatus: Failed  | Wrong firmware file or transient BMC error  | Do **not** reboot. Retry. Contact NVIDIA Support +| BackgroundCopyStatus: Failed              | Wrong firmware file or transient BMC error  | Do **not** reboot. Retry. Contact NVIDIA Support 
-| EROT/CPLD version unchanged after reboot  | AC power cycle skipped or too short  | Repeat Step 2.9 — full AC removal, 5–6 minute wait  | +| EROT/CPLD version unchanged after reboot  | AC power cycle skipped or too short          | Repeat Step 2.9 — full AC removal, 5–6 minute wait  | 
-| Server does not boot after AC cycle  | Failed flash  | BMC serial console + ''ipmitool sel list''; contact NVIDIA Support +| Server does not boot after AC cycle       | Failed flash                                | BMC serial console + ''ipmitool sel list''; contact NVIDIA Support 
-| GPU missing after upgrade  | PCIe or driver issue  | ''lspci'', ''dmesg'' for Xid; reload driver +| GPU missing after upgrade                 | PCIe or driver issue                        | ''lspci'', ''dmesg'' for Xid; reload driver 
-| NVLink ports Inactive  | NVSwitch FW or fabric manager  | Cold reboot; restart ''nvidia-fabricmanager'' +| NVLink ports Inactive                     | NVSwitch FW or fabric manager               | Cold reboot; restart ''nvidia-fabricmanager'' 
-| Firmware unchanged after reboot  | Rebooted before background copy finished  | Re-run, wait for ''Completed'' +| Firmware unchanged after reboot           | Rebooted before background copy finished    | Re-run, wait for ''Completed'' 
-| ''Downgrade not allowed''  | ForceUpdate flag not enabled  | See Section 9  | +| ''Downgrade not allowed''                 | ForceUpdate flag not enabled                | See Section 9  | 
-| BMC 1.1.→ 1.2.x update bricks Redfish  Skipped the 1.1.3 intermediate hop  | See "BMC Recovery" below; flash 1.1.3 first  | +| BMC unreachable after a version-hop skip  | Skipped the v1.1.3 intermediate hop (v1.0.0/v1.1.→ v24.09.1 direct) Factory-reset the BMC, then re-stage v1.0.0/v1.1.1 → v1.1.3 → v24.09.1; see "BMC Recovery" below  | 
-| PSU update refuses to start  | PSU redundancy degraded  | ''nvsm show health | grep -i psu''; replace failed PSU before retrying +| PSU update refuses to start               | PSU redundancy degraded                     | ''nvsm show health | grep -i psu''; replace failed PSU before retrying 
-| Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch  | Confirm driver ≥ release notes min; ''journalctl -u nvidia-fabricmanager''; reload ''nvidia'' module +| Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch              | Confirm driver ≥ release notes min; ''journalctl -u nvidia-fabricmanager''; reload ''nvidia'' module 
-| New SEL entries after upgrade  | Thermal/fan/PSU event during reboot  | Investigate each entry; do not restore workloads until cleared  |+| New SEL entries after upgrade             | Thermal/fan/PSU event during reboot         | Investigate each entry; do not restore workloads until cleared  |
  
-==== BMC Recovery (Bricked BMC) ====+==== BMC Recovery (Unreachable BMC) ====
  
-If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH):+If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH), **do not attempt field recovery from a USB image**. The BMC recovery image, layout, and procedure are NVIDIA-internal and SKU-specific; running the wrong image can permanently brick the chassis BMC and is not covered by warranty.
  
-  Connect to the **BMC serial console** via the rear-panel serial port or the integrated serial-over-USB on the management cable. Default baud: 115200 8N1. +Instead: 
-  - At the BMC u-boot/recovery prompt, confirm you have the recovery shell — `help` should list `flash`, `boot`, `reset`. + 
-  - Prepare a **USB flash drive (FAT32)** containing the BMC recovery image (''bmc_recovery_<version>.img'') downloaded from the NVIDIA Enterprise Support Portal — file naming and partition layout differ between H100 and B200; use the recovery image that matches the **chassis SKU**, not the package version you were trying to install+  - **Open an NVIDIA Enterprise Support case** with severity matching your maintenance window. 
-  - Insert the USB into the BMC's dedicated recovery USB port (rear panellabelled ''BMC RECOVERY'' — **not** a host USB port). +  - **Capture the BMC serial console** at **115200 8N1** (rear-panel serial port or serial-over-USB on the management cable) from the moment the BMC fails through any attempted recovery. Attach the full log to the support case — NVIDIA will require it before authorising any recovery image. 
-  - From the recovery shell: ''flash usb bmc_recovery_<version>.img'' then ''reset''+  - Provide: chassis SKU, current BMC version (last known good)target BMC versionthe package P-number / file name attempted, and confirmation of whether the v1.0.0 / v1.1.1 → v1.1.3 → v24.09.1 hop was followed
-  - Allow 5 min for the BMC to reboot; then re-establish IPMI and re-run the failed update **at the intermediate version** (e.g.1.1.3) before re-attempting the target version+  - Do not pull the chassis from the rack or attempt a factory reset until NVIDIA confirms the recovery path — some failures are recoverable from the host side without touching the BMC.
-  - Only proceed to NVIDIA Support if recovery flash itself fails — they will require the serial console log.+
  
 ---- ----
Line 715: Line 751:
 ===== 11. Risk Register ===== ===== 11. Risk Register =====
  
-^  Risk  ^  Likelihood  ^  Impact  ^  Mitigation +^  Risk                                       ^  Likelihood  ^  Impact      ^  Mitigation 
-| Wrong firmware file flashed  | Medium  | Critical  | ''show_version'' first; checksum +| Wrong firmware file flashed                 | Medium       | Critical     | ''show_version'' first; checksum 
-| Power loss during flash  | Low  | Critical  | Stable power; avoid power maintenance windows +| Power loss during flash                     | Low          | Critical     | Stable power; avoid power maintenance windows 
-| Workload active during upgrade  | Medium  | High  | Verify ''nvidia-smi'' apps query empty  | +| Workload active during upgrade              | Medium       | High         | Verify ''nvidia-smi'' apps query empty  | 
-| Driver / firmware mismatch  | Medium  | High  | Check per-package release notes  | +| Driver / firmware mismatch                  | Medium       | High         | Check per-package release notes  | 
-| Network loss to BMC mid-update  | Low  | High  | Wired network only  | +| Network loss to BMC mid-update              | Low          | High         | Wired network only  | 
-| Previous .fwpkg unavailable  | Medium  | High  | Save before every upgrade +| Previous .fwpkg unavailable                 | Medium       | High         | Save before every upgrade 
-| ForceUpdate left enabled  | Low  | Medium  | Disable + verify after downgrade +| ForceUpdate left enabled                    | Low          | Medium       | Disable + verify after downgrade 
-| AC cycle skipped — EROT/CPLD not active  | High  | High  | AC step is mandatory in Phase 2  | +| AC cycle skipped — EROT/CPLD not active     | High         | High         | AC step is mandatory in Phase 2  | 
-| BMC version-hop skipped (pre-1.1.3 → 1.2.x) | Medium  | Critical  | Stage to 1.1.3 first  | +| BMC version-hop skipped (v1.0.0/v1.1.1 → v24.09.1 direct) | Medium | Critical | Stage v1.0.0/v1.1.1 → v1.1.3 → v24.09.1; direct skip requires a BMC factory reset to recover  | 
-| Reboot before background copy completes  | Medium  | High  | Poll Redfish until ''Completed''  |+| Reboot before background copy completes     | Medium       | High         | Poll Redfish until ''Completed''  |
  
 ---- ----
Line 736: Line 772:
  
 ---- ----
- 
  
wiki/ai/gpu_firmware_upgrade.1778504017.txt.gz · Last modified: by smadanagopal