This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| wiki:ai:gpu_firmware_upgrade [2026/05/11 12:53] – smadanagopal | wiki:ai:gpu_firmware_upgrade [2026/05/12 10:00] (current) – smadanagopal | ||
|---|---|---|---|
| Line 6: | Line 6: | ||
| - **Validate Package** — checksum the firmware file, preview what will change | - **Validate Package** — checksum the firmware file, preview what will change | ||
| - **Phase 2 (in this exact order):** | - **Phase 2 (in this exact order):** | ||
| - | | + | |
| - | - BMC reset (mc reset cold) | + | - BMC reset (mc reset cold) |
| - | - Motherboard tray update (SBIOS, CPLD, PSU, etc.) | + | - Motherboard tray update (SBIOS, CPLD, PSU, etc.) |
| - | - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) | + | - GPU tray update (VBIOS, NVSwitch, EROT, FPGA) |
| - | - Poll BackgroundCopyStatus until **Completed** | + | - Poll BackgroundCopyStatus until **Completed** |
| - | - Chassis power cycle (graceful, via Redfish/ | + | - Chassis power cycle (graceful, via Redfish/ |
| - | - Re-check versions on staged components | + | - Re-check versions on staged components |
| - | - CommitImage on **HGX_FW_BMC_0** and **HGX_FW_HostBIOS_0** | + | - CommitImage on **HostBMC_0** and **HostBIOS_0** |
| - | - **AC power cycle** (full PDU/cord removal, 5–6 min) | + | - **AC power cycle** (full PDU/cord removal, 5–6 min) |
| - | - NIC, TPM, NVMe firmware | + | - NIC firmware via '' |
| - **Validate** — confirm new versions, GPU health, NVLink, smoke test | - **Validate** — confirm new versions, GPU health, NVLink, smoke test | ||
| - | <note warning> Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing. | + | <note warning> |
| + | Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing. | ||
| - | **Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. </ | + | **Critical ordering:** BMC must be updated and reset *before* the motherboard and GPU trays so the BMC speaks the new Redfish schema used by the rest of the bundle. Skipping the AC power cycle leaves EROT/CPLD images staged but not activated. |
| + | </ | ||
| ---- | ---- | ||
| Line 29: | Line 31: | ||
| **Scope — what this runbook covers:** | **Scope — what this runbook covers:** | ||
| - | |||
| * **Hopper:** DGX H100/H200, HGX H100/H200 | * **Hopper:** DGX H100/H200, HGX H100/H200 | ||
| * **Blackwell: | * **Blackwell: | ||
| Line 36: | Line 37: | ||
| **Blackwell-specific notes (GB200 / B200):** | **Blackwell-specific notes (GB200 / B200):** | ||
| - | + | | |
| - | | + | |
| - | | + | * **Ordering, simplified: |
| - | * **NVLink Switch tray** firmware ('' | + | * **Activation:** Mission Control uses the documented activation method |
| - | * The mandatory **AC power cycle** for EROT/ | + | * The final mandatory **rack-level |
| - | * **nvfwupd | + | * **nvfwupd |
| **Scope — what this runbook does NOT cover:** | **Scope — what this runbook does NOT cover:** | ||
| - | |||
| * DGX A100 (use the Firmware Update Container — separate runbook) | * DGX A100 (use the Firmware Update Container — separate runbook) | ||
| * Standalone Mellanox/ | * Standalone Mellanox/ | ||
| Line 56: | Line 56: | ||
| ==== Components Updated ==== | ==== Components Updated ==== | ||
| - | ^ Component | + | ^ Component |
| - | | BMC | Out-of-band server management | + | | BMC | Out-of-band server management |
| - | | SBIOS / UEFI | Motherboard initialisation | + | | SBIOS / UEFI | Motherboard initialisation |
| - | | CPLD | Board-level glue logic, power sequencing | + | | CPLD |
| - | | VBIOS | GPU boot, clocks, power limits, PCIe config | + | | VBIOS | GPU boot, clocks, power limits, PCIe config |
| - | | NVSwitch FW | GPU-to-GPU interconnect (NVLink fabric) | + | | NVSwitch FW | GPU-to-GPU interconnect (NVLink fabric) |
| - | | EROT | Hardware root of trust / secure boot | **AC power cycle (mandatory)** | + | | EROT |
| - | | FPGA | Power management and control plane | Chassis power cycle | | + | | FPGA |
| - | | PSU | Power supply unit firmware | + | | PSU | Power supply unit firmware |
| | NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/ | | NIC / ConnectX | Onboard NIC firmware (incl. ConnectX-7/ | ||
| - | | TPM | Trusted Platform Module | + | | TPM | Trusted Platform Module |
| - | | NVMe | Onboard storage firmware | + | | NVMe |
| - | <note important> | + | <note important> |
| + | EROT and CPLD images are staged into a secondary flash region by the update, but only activate on a full AC power removal — a warm reboot or chassis power cycle is not sufficient. | ||
| + | </ | ||
| ==== Driver Compatibility ==== | ==== Driver Compatibility ==== | ||
| Line 84: | Line 86: | ||
| ==== Time Budget (typical per-node) ==== | ==== Time Budget (typical per-node) ==== | ||
| - | ^ Phase ^ Estimated Time ^ | + | ^ Phase ^ Estimated Time ^ |
| - | | Pre-upgrade preparation (Phase 1) | 20 min | | + | | Pre-upgrade preparation (Phase 1) |
| - | | BMC update + reset + verify (2.1–2.2) | 10 min | | + | | BMC update + reset + verify (2.1–2.2) | 10 min |
| - | | Motherboard tray update (2.3) | 15 min | | + | | Motherboard tray update (2.3) |
| - | | GPU tray update (2.4) | 25 min | | + | | GPU tray update (2.4) |
| - | | BackgroundCopyStatus poll (2.5) | 10–20 min | | + | | BackgroundCopyStatus poll (2.5) |
| - | | Chassis power cycle + recheck (2.6–2.7) | 10 min | | + | | Chassis power cycle + recheck (2.6–2.7) | 10 min |
| - | | CommitImage (2.8) | 2 min | | + | | CommitImage (2.8) |
| - | | AC power cycle (2.9, includes 6m drain) | 20 min | | + | | AC power cycle (2.9, includes 6m drain) | 20 min |
| - | | NIC / TPM / NVMe + chassis cycle (2.10) | 15 min | | + | | NIC / TPM / NVMe + chassis cycle (2.10) | 15 min |
| - | | Validation (Phase 3) | 15 min | | + | | Validation (Phase 3) | 15 min |
| - | | **Total — Hopper DGX/ | + | | **Total — Hopper DGX/ |
| | GB200 NVL72 rack (add HMC, NVLink Switch tray, rack PDU cycle 10m+15m boot) | **add 1h 15m** | | | GB200 NVL72 rack (add HMC, NVLink Switch tray, rack PDU cycle 10m+15m boot) | **add 1h 15m** | | ||
| Line 106: | Line 108: | ||
| ==== Tool Overview ==== | ==== Tool Overview ==== | ||
| - | ^ Tool ^ Purpose | + | ^ Tool ^ Purpose |
| - | | **nvfwupd** | + | | **nvfwupd** |
| - | | **NVFlash** | + | | **NVFlash** |
| - | | **nvidia-smi** | Query GPU health, firmware version, running processes | + | | **nvidia-smi** | Query GPU health, firmware version, running processes |
| - | | **nvsm** | + | | **nvsm** |
| - | | **ipmitool** | + | | **ipmitool** |
| - | | **curl** | + | | **curl** |
| ==== nvfwupd Version Requirement ==== | ==== nvfwupd Version Requirement ==== | ||
| - | * **Hopper (H100/H200) minimum: nvfwupd 2.0.4** — earlier versions do not support CommitImage or the HGX_FW_* | + | * **Hopper (H100/H200) minimum: nvfwupd 2.0.4** — earlier versions do not support CommitImage or the inventory paths used below. |
| - | * **Blackwell (B200 / GB200 / GB300) minimum: nvfwupd | + | * **Blackwell (B200 / GB200 / GB300) minimum: nvfwupd |
| * Confirm with: '' | * Confirm with: '' | ||
| * Download from the NVIDIA Enterprise Support Portal. | * Download from the NVIDIA Enterprise Support Portal. | ||
| Line 136: | Line 138: | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| + | |||
| + | # Blackwell / Mission Control activation (replaces ipmitool mc reset cold) | ||
| + | nvfwupd -t ip=< | ||
| + | activate_fw -c RF_AUX_PWR_CYCLE | ||
| + | |||
| + | # --- mlxfwmanager (host OS, Hopper NIC updates) --- | ||
| + | sudo mlxfwmanager --query | ||
| + | sudo mlxfwmanager -u -i fw-ConnectX7-rel-< | ||
| # --- NVFlash --- | # --- NVFlash --- | ||
| Line 164: | Line 174: | ||
| ===== 4. Prerequisites ===== | ===== 4. Prerequisites ===== | ||
| - | <note important> | + | <note important> |
| + | Complete every item below before starting. Do not proceed if any item cannot be confirmed. | ||
| + | </ | ||
| ==== Access & Credentials ==== | ==== Access & Credentials ==== | ||
| - | |||
| * [ ] BMC IP address, username, and password (administrator privileges) | * [ ] BMC IP address, username, and password (administrator privileges) | ||
| * [ ] SSH access to the target server | * [ ] SSH access to the target server | ||
| Line 174: | Line 185: | ||
| ==== Tooling ==== | ==== Tooling ==== | ||
| - | |||
| * [ ] **nvfwupd ≥ 2.0.4** installed on management host | * [ ] **nvfwupd ≥ 2.0.4** installed on management host | ||
| * [ ] ipmitool, curl, jq available on management host | * [ ] ipmitool, curl, jq available on management host | ||
| ==== Firmware Package ==== | ==== Firmware Package ==== | ||
| - | |||
| * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages) | * [ ] Correct .fwpkg for your specific platform (GPU tray + motherboard tray packages) | ||
| * [ ] SHA256 checksum verified against release notes | * [ ] SHA256 checksum verified against release notes | ||
| Line 186: | Line 195: | ||
| ==== BMC Version Hop (pre-1.1.3 systems) ==== | ==== BMC Version Hop (pre-1.1.3 systems) ==== | ||
| - | + | | |
| - | | + | * [ ] Confirm current BMC version |
| - | * [ ] Confirm current BMC version: | + | <code bash> |
| + | | ||
| + | | ||
| + | </ | ||
| ==== Environment ==== | ==== Environment ==== | ||
| - | |||
| * [ ] All GPU workloads stopped | * [ ] All GPU workloads stopped | ||
| * [ ] '' | * [ ] '' | ||
| Line 204: | Line 215: | ||
| ==== System State ==== | ==== System State ==== | ||
| - | |||
| * [ ] Management host can ping the BMC IP | * [ ] Management host can ping the BMC IP | ||
| * [ ] Server on stable power | * [ ] Server on stable power | ||
| Line 211: | Line 221: | ||
| ==== Rollback Readiness ==== | ==== Rollback Readiness ==== | ||
| - | |||
| * [ ] Current firmware versions recorded in change ticket | * [ ] Current firmware versions recorded in change ticket | ||
| * [ ] Previous firmware .fwpkg saved | * [ ] Previous firmware .fwpkg saved | ||
| Line 220: | Line 229: | ||
| ===== 5. Best Practices ===== | ===== 5. Best Practices ===== | ||
| - | ^ # ^ Practice | + | ^ # ^ Practice |
| - | | 1 | Always run '' | + | | 1 |
| - | | 2 | Back up VBIOS before flashing | + | | 2 |
| - | | 3 | Stop all GPU processes including '' | + | | 3 |
| - | | 4 | Never interrupt mid-flash | + | | 4 |
| - | | 5 | Verify the package checksum | + | | 5 |
| - | | 6 | Wait for BackgroundCopyStatus: | + | | 6 |
| - | | 7 | **Always AC-cycle after CommitImage** | + | | 7 |
| - | | 8 | Use a wired management network | + | | 8 |
| - | | 9 | Disable ForceUpdate immediately after downgrade | + | | 9 |
| - | | 10 | Update BMC first, everything else after | Subsequent components depend on the new BMC Redfish schema | + | | 10 | Update BMC first, everything else after | Subsequent components depend on the new BMC Redfish schema |
| - | | 11 | Validate before restoring workloads | + | | 11 | Validate before restoring workloads |
| - | | 12 | Document pre and post versions | + | | 12 | Document pre and post versions |
| ---- | ---- | ||
| Line 251: | Line 260: | ||
| Copy full output to your change ticket and fill in: | Copy full output to your change ticket and fill in: | ||
| - | ^ Component | + | ^ Component |
| - | | BMC | | + | | BMC | |
| - | | SBIOS | | + | | SBIOS | |
| - | | CPLD | | + | | CPLD |
| - | | VBIOS | | + | | VBIOS | |
| - | | NVSwitch FW | | + | | NVSwitch FW | |
| - | | EROT FW | | + | | EROT FW | |
| - | | FPGA FW | | + | | FPGA FW | |
| - | | PSU | | + | | PSU | |
| - | | NIC | | + | | NIC | |
| - | | TPM | | + | | TPM | |
| - | | NVMe | | + | | NVMe |
| ==== Step 1.2 — Verify GPU Health ==== | ==== Step 1.2 — Verify GPU Health ==== | ||
| Line 317: | Line 326: | ||
| ===== 7. Phase 2 — Firmware Upgrade Execution ===== | ===== 7. Phase 2 — Firmware Upgrade Execution ===== | ||
| - | <note warning> **Strict ordering — do not deviate.** BMC first, then motherboard tray, then GPU tray, then poll, then chassis power cycle, then CommitImage, | + | <note warning> |
| + | **Strict ordering — do not deviate.** BMC first, then motherboard tray, then GPU tray, then poll, then chassis power cycle, then CommitImage, | ||
| + | </ | ||
| ==== 7.1 DGX / HGX — nvfwupd ==== | ==== 7.1 DGX / HGX — nvfwupd ==== | ||
| === Step 2.1 — Update BMC === | === Step 2.1 — Update BMC === | ||
| + | |||
| + | Use the documented one-shot BMC selector. Do not hand-enumerate child paths — let nvfwupd resolve them from the package. | ||
| <code bash> | <code bash> | ||
| Line 327: | Line 340: | ||
| { | { | ||
| " | " | ||
| - | "/ | + | "/ |
| ] | ] | ||
| } | } | ||
| Line 342: | Line 355: | ||
| sleep 180 | sleep 180 | ||
| ipmitool -I lanplus -H < | ipmitool -I lanplus -H < | ||
| - | nvfwupd -t ip=< | + | |
| + | # Pin the recheck to the BMC Redfish path (not a substring match on " | ||
| + | # which also matches HGX_FW_BMC_* aggregate entries on some packages). | ||
| + | nvfwupd -t ip=< | ||
| + | | ||
| </ | </ | ||
| Line 348: | Line 365: | ||
| === Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) === | === Step 2.3 — Update Motherboard Tray (SBIOS, CPLD, PSU) === | ||
| + | |||
| + | The motherboard tray uses the **empty selector** ('' | ||
| <code bash> | <code bash> | ||
| cat > selector_mb.json <<' | cat > selector_mb.json <<' | ||
| - | { | + | {} |
| - | " | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | ] | + | |
| - | } | + | |
| EOF | EOF | ||
| Line 366: | Line 376: | ||
| update_fw -p nvfw_DGX_< | update_fw -p nvfw_DGX_< | ||
| </ | </ | ||
| + | |||
| + | Cross-check beforehand with '' | ||
| === Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) === | === Step 2.4 — Update GPU Tray (VBIOS, NVSwitch, EROT, FPGA) === | ||
| + | |||
| + | The GPU tray uses the **HGX_0 one-shot selector**, which targets the entire GPU baseboard inventory in a single call. The BMC fans out internally to all GPUs, NVSwitch chips, EROT, and FPGA — do not enumerate per-GPU paths. | ||
| <code bash> | <code bash> | ||
| Line 373: | Line 387: | ||
| { | { | ||
| " | " | ||
| - | "/ | + | "/ |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| ] | ] | ||
| } | } | ||
| Line 394: | Line 395: | ||
| update_fw -p nvfw_DGX-HGX-H100-H200x8_< | update_fw -p nvfw_DGX-HGX-H100-H200x8_< | ||
| </ | </ | ||
| + | |||
| + | Verify the exact child paths your platform exposes with '' | ||
| === Step 2.5 — Wait for Background Copy to Complete === | === Step 2.5 — Wait for Background Copy to Complete === | ||
| Line 428: | Line 431: | ||
| curl -k -u < | curl -k -u < | ||
| -H " | -H " | ||
| - | -d ' | + | -d ' |
| https://< | https://< | ||
| Line 434: | Line 437: | ||
| curl -k -u < | curl -k -u < | ||
| -H " | -H " | ||
| - | -d ' | + | -d ' |
| https://< | https://< | ||
| </ | </ | ||
| Line 440: | Line 443: | ||
| === Step 2.9 — AC Power Cycle (MANDATORY) === | === Step 2.9 — AC Power Cycle (MANDATORY) === | ||
| - | <note warning> This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient. </ | + | <note warning> |
| + | This step is **non-negotiable** for EROT and CPLD activation. A chassis power cycle is **not** sufficient. | ||
| + | </ | ||
| - Power off the host via OS shutdown or '' | - Power off the host via OS shutdown or '' | ||
| Line 448: | Line 453: | ||
| - Wait 5–10 minutes for full boot. | - Wait 5–10 minutes for full boot. | ||
| - | === Step 2.10 — Update NIC, TPM, NVMe (and ConnectX-7/ | + | === Step 2.10 — Update NIC (mlxfwmanager), TPM, NVMe === |
| These components are independent of the GPU/EROT activation chain and are updated last. | These components are independent of the GPU/EROT activation chain and are updated last. | ||
| - | **Hopper (DGX/HGX H100/H200):** | + | **Hopper (DGX/HGX H100/ |
| <code bash> | <code bash> | ||
| - | cat > selector_periph.json <<' | + | # Inventory and current versions on the host OS |
| - | { | + | sudo mlxfwmanager --query |
| - | " | + | |
| - | | + | # Apply the bundled MFT firmware image (matches your ConnectX SKU) |
| - | "/ | + | sudo mlxfwmanager -u -i fw-ConnectX7-rel-< |
| - | " | + | |
| - | "/ | + | # Re-query to confirm the new FW Version is present and "Status: Done" |
| - | ] | + | sudo mlxfwmanager --query |
| - | } | + | </code> |
| - | EOF | + | |
| + | A subsequent chassis power cycle (or PCIe FLR via '' | ||
| + | |||
| + | TPM and NVMe remain BMC-driven on Hopper. Confirm the exact inventory paths on your chassis with '' | ||
| + | |||
| + | <code bash> | ||
| + | # After confirming the actual inventory names on your chassis: | ||
| + | nvfwupd -t ip=< | ||
| + | show_version | grep -Ei ' | ||
| + | # Then build a selector_tpm_nvme.json with the paths returned above and: | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | update_fw -p nvfw_DGX_< | + | update_fw -p nvfw_DGX_< |
| </ | </ | ||
| Line 491: | Line 505: | ||
| Then host reboot: | Then host reboot: | ||
| - | |||
| <code bash> | <code bash> | ||
| ipmitool -I lanplus -H < | ipmitool -I lanplus -H < | ||
| </ | </ | ||
| - | ==== 7.1B Blackwell — GB200 / B200 Additional Steps ==== | + | ==== 7.1B Blackwell — GB200 / B200 (Mission Control flow) ==== |
| - | Blackwell systems insert | + | <note important> |
| + | This subsection summarises | ||
| + | </ | ||
| - | === Step 2.0a — Update HMC (GB200 only) === | + | Blackwell delivers two top-level packages plus the NVLink Switch tray package: |
| - | The HMC manages | + | ^ Package name ^ Selector JSON |
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | |||
| + | **Ordering (per Mission Control): | ||
| + | |||
| + | - **B1.** Push '' | ||
| + | - **B2.** '' | ||
| + | - **B3.** **AC power cycle between BMC and HMC updates** (rack PDU cycle for NVL72; chassis cord pull for standalone B200). Wait 5 min for standby drain. This is required so the new chassis BMC re-enumerates the HMC under the updated Redfish schema before Compute_Full is pushed. | ||
| + | - **B4.** Push '' | ||
| + | - **B5.** '' | ||
| + | - **B6.** Push the NVLink Switch tray package from the rack-level management host against the NVLink Switch tray BMC (not the compute tray BMC). | ||
| + | - **B7.** Final **rack-level AC cycle** (Step 2.9 / 2.9b) to activate EROT and CPLD on every tray. | ||
| + | |||
| + | === Step B1 — Push BMC_Full to the chassis BMC === | ||
| + | |||
| + | The Mission Control selector JSON is a one-shot; do not enumerate child components. | ||
| <code bash> | <code bash> | ||
| - | cat > selector_hmc.json <<' | + | cat > BMC_Full.json <<' |
| { | { | ||
| " | " | ||
| - | "/ | + | "/ |
| ] | ] | ||
| } | } | ||
| EOF | EOF | ||
| - | nvfwupd -t ip=<HMC_IP> user=< | + | nvfwupd -t ip=<BMC_IP> user=< |
| - | update_fw -p nvfw_GB200_< | + | update_fw -p nvfw_BMC_P4978.fwpkg -y -s BMC_Full.json |
| </ | </ | ||
| - | Reset HMC and confirm: | + | === Step B2 — Activate the chassis BMC payload === |
| + | |||
| + | Blackwell uses the documented Mission Control activation verb, not '' | ||
| <code bash> | <code bash> | ||
| - | curl -k -u < | + | nvfwupd -t ip=<BMC_IP> user=< |
| - | https://< | + | activate_fw |
| - | -d ' | + | |
| - | sleep 180 | + | |
| - | nvfwupd -t ip=<HMC_IP> user=< | + | |
| </ | </ | ||
| - | === Step 2.2b — Update Grace CPU Firmware | + | === Step B3 — AC cycle between BMC and HMC updates === |
| + | |||
| + | This is **not** the final EROT/CPLD AC cycle — it sits between the two halves of the bundle and is mandatory. | ||
| + | |||
| + | - Standalone B200: chassis cord pull, 5 min drain. | ||
| + | - GB200 NVL72: rack-level PDU cycle (both A-side and B-side feeds), 10 min drain, 15 min boot. | ||
| + | - Confirm the chassis BMC comes back on the new firmware before proceeding: '' | ||
| + | |||
| + | === Step B4 — Push Compute_Full | ||
| - | Bundled with the SBIOS package; activates on chassis power cycle. | + | '' |
| <code bash> | <code bash> | ||
| - | cat > selector_grace.json <<' | + | cat > Compute_Full.json <<' |
| { | { | ||
| " | " | ||
| - | "/ | + | "/ |
| - | "/ | + | |
| ] | ] | ||
| } | } | ||
| Line 542: | Line 580: | ||
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | update_fw -p nvfw_GB200_< | + | update_fw -p nvfw_Compute_P4978.fwpkg -y -s Compute_Full.json |
| </ | </ | ||
| - | === Step 2.4b — Update NVLink Switch Tray (GB200 NVL72) | + | === Step B5 — Activate Compute_Full |
| - | Run from the rack-level management host against the NVLink Switch tray BMC, **not** | + | <code bash> |
| + | nvfwupd -t ip=< | ||
| + | activate_fw -c RF_AUX_PWR_CYCLE | ||
| + | </ | ||
| + | |||
| + | === Step B6 — NVLink Switch tray (GB200 NVL72) === | ||
| + | |||
| + | Run from the rack-level management host against the NVLink Switch tray BMC. Confirm | ||
| <code bash> | <code bash> | ||
| - | cat > selector_nvlsw.json | + | nvfwupd -t ip=< |
| - | { | + | |
| - | | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | "/ | + | |
| - | ] | + | |
| - | } | + | |
| - | EOF | + | |
| nvfwupd -t ip=< | nvfwupd -t ip=< | ||
| - | | + | |
| </ | </ | ||
| - | === Step 2.9b — Rack AC Cycle (GB200 NVL72) === | + | === Step B7 — Final rack-level |
| - | + | ||
| - | For GB200 NVL72 the AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations: | + | |
| + | For GB200 NVL72 the final AC step is a **rack-level PDU cycle**, not a single chassis cord pull. Coordinate with DC operations: | ||
| - Schedule a full rack power-down window. | - Schedule a full rack power-down window. | ||
| - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised). | - Open both A-side and B-side PDU feeds (the rack is fed redundantly — pulling only one side leaves standby rails energised). | ||
| Line 634: | Line 671: | ||
| ==== Step 3.6 — Record Final State ==== | ==== Step 3.6 — Record Final State ==== | ||
| - | ^ Component | + | ^ Component |
| - | | BMC | | + | | BMC | |
| - | | SBIOS | | + | | SBIOS | |
| - | | CPLD | | + | | CPLD |
| - | | VBIOS | | + | | VBIOS | |
| - | | NVSwitch FW | | + | | NVSwitch FW | |
| - | | EROT FW | | + | | EROT FW | |
| - | | FPGA FW | | + | | FPGA FW | |
| - | | PSU | | + | | PSU | |
| - | | NIC | | + | | NIC | |
| - | | TPM | | + | | TPM | |
| - | | NVMe | | + | | NVMe |
| ==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services ==== | ==== Step 3.7 — Capture Post-Upgrade SEL and Restart Services ==== | ||
| Line 683: | Line 720: | ||
| ===== 10. Troubleshooting ===== | ===== 10. Troubleshooting ===== | ||
| - | ^ Symptom | + | ^ Symptom |
| - | | '' | + | | '' |
| - | | '' | + | | '' |
| - | | VBIOS update blocked — GPU activity | + | | VBIOS update blocked — GPU activity |
| - | | BackgroundCopyStatus: | + | | BackgroundCopyStatus: |
| - | | EROT/CPLD version unchanged after reboot | + | | EROT/CPLD version unchanged after reboot |
| - | | Server does not boot after AC cycle | Failed flash | BMC serial console + '' | + | | Server does not boot after AC cycle |
| - | | GPU missing after upgrade | + | | GPU missing after upgrade |
| - | | NVLink ports Inactive | + | | NVLink ports Inactive |
| - | | Firmware unchanged after reboot | + | | Firmware unchanged after reboot |
| - | | '' | + | | '' |
| - | | BMC 1.1.x → 1.2.x update bricks Redfish | + | | BMC unreachable after a version-hop skip | Skipped the v1.1.3 intermediate hop (v1.0.0/v1.1.1 → v24.09.1 direct) |
| - | | PSU update refuses to start | PSU redundancy degraded | + | | PSU update refuses to start |
| - | | Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch | + | | Fabric Manager fails to start post-upgrade | NVSwitch FW / driver mismatch |
| - | | New SEL entries after upgrade | + | | New SEL entries after upgrade |
| - | ==== BMC Recovery (Bricked | + | ==== BMC Recovery (Unreachable |
| - | If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH): | + | If the BMC is unreachable after an update (no Redfish, no IPMI, no SSH), **do not attempt field recovery from a USB image**. The BMC recovery image, layout, and procedure are NVIDIA-internal and SKU-specific; |
| - | | + | Instead: |
| - | - At the BMC u-boot/ | + | |
| - | - Prepare a **USB flash drive (FAT32)** containing | + | |
| - | - Insert the USB into the BMC's dedicated recovery USB port (rear panel, labelled '' | + | - **Capture the BMC serial console** |
| - | - From the recovery shell: '' | + | - Provide: chassis SKU, current |
| - | - Allow 5 min for the BMC to reboot; then re-establish IPMI and re-run | + | - Do not pull the chassis from the rack or attempt a factory reset until NVIDIA |
| - | - Only proceed to NVIDIA | + | |
| ---- | ---- | ||
| Line 715: | Line 751: | ||
| ===== 11. Risk Register ===== | ===== 11. Risk Register ===== | ||
| - | ^ Risk ^ Likelihood | + | ^ Risk |
| - | | Wrong firmware file flashed | + | | Wrong firmware file flashed |
| - | | Power loss during flash | Low | Critical | + | | Power loss during flash |
| - | | Workload active during upgrade | + | | Workload active during upgrade |
| - | | Driver / firmware mismatch | + | | Driver / firmware mismatch |
| - | | Network loss to BMC mid-update | + | | Network loss to BMC mid-update |
| - | | Previous .fwpkg unavailable | + | | Previous .fwpkg unavailable |
| - | | ForceUpdate left enabled | + | | ForceUpdate left enabled |
| - | | AC cycle skipped — EROT/CPLD not active | + | | AC cycle skipped — EROT/CPLD not active |
| - | | BMC version-hop skipped (pre-1.1.3 → 1.2.x) | Medium | + | | BMC version-hop skipped (v1.0.0/v1.1.1 → v24.09.1 direct) | Medium | Critical | Stage v1.0.0/v1.1.1 → v1.1.3 → v24.09.1; direct skip requires a BMC factory reset to recover |
| - | | Reboot before background copy completes | + | | Reboot before background copy completes |
| ---- | ---- | ||
| Line 736: | Line 772: | ||
| ---- | ---- | ||
| - | |||