This is an old revision of the document!
NOTE: Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing.
This runbook covers the end-to-end firmware upgrade process for NVIDIA data center GPUs. It is self-contained — do not consult external documentation mid-upgrade.
Firmware is low-level software embedded in the GPU hardware itself. It controls boot behaviour, power management, and hardware communication — below the OS driver layer.
Components updated in a full DGX/HGX firmware upgrade:
| Component | What It Controls |
|---|---|
| VBIOS | GPU boot, clocks, power limits, PCIe config |
| NVSwitch | GPU-to-GPU interconnect (NVLink fabric) |
| EROT | Hardware root of trust / secure boot |
| FPGA | Power management and control plane |
| BMC | Out-of-band server management |
| BIOS / UEFI | Motherboard initialisation |
| PSU | Power supply unit firmware |
| GPU Architecture | Minimum Driver |
|---|---|
| Ampere (A100, A30) | R450 |
| Hopper (H100, H200) | R525 |
| Ada Lovelace (L40, L40S) | R525 |
| Tool | Purpose | Where It Runs |
|---|---|---|
| nvfwupd | Full firmware bundle update for DGX/HGX via BMC Redfish API | Management host (remote) |
| NVFlash | VBIOS-only flash for standalone PCIe GPUs | GPU host (root required) |
| nvidia-smi | Query GPU health, firmware version, running processes | GPU host |
| nvsm | DGX system-wide health check | GPU host |
| ipmitool | Remote power cycle and BMC reset | Management host |
| Platform | Tool |
|---|---|
| DGX H100 / H200 | nvfwupd |
| HGX H100 / H200 | nvfwupd |
| DGX A100 | Firmware Update Container |
| Standalone PCIe GPU (A100, RTX etc) | NVFlash |
# --- nvfwupd --- # Show installed vs package versions nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg> # Update all firmware components nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y # Update specific components only (with selector JSON) nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y -s selector.json # Force downgrade controls nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable # --- NVFlash (standalone PCIe GPUs only) --- sudo nvflash --list # List all GPUs and indices sudo nvflash --version # Show VBIOS version (GPU 0) sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom # Backup current VBIOS sudo nvflash --index=0 new_firmware.rom # Flash new VBIOS to GPU 0 # --- nvidia-smi --- nvidia-smi -L # List all GPUs nvidia-smi -q # Full GPU info nvidia-smi -q | grep "VBIOS Version" # VBIOS version only nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv # Active GPU processes nvidia-smi nvlink -s -i 0 # NVLink status for GPU 0 # --- ipmitool --- ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle # Cold reboot ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold # BMC-only reset ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info # Confirm BMC online ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> sel list # BMC event log # --- nvsm --- nvsm show health # System-wide health (DGX only)
<note important> Complete every item below before starting. Do not proceed if any item cannot be confirmed. </note>
nvidia-smi is not running — even passive queries block VBIOS updatesnvsm show health or BMC dashboard| # | Practice | Why It Matters |
|---|---|---|
| 1 | Always run show_version first | Confirms exactly what will change before any flashing begins |
| 2 | Back up VBIOS before flashing | No backup = no rollback path for standalone GPU upgrades |
| 3 | Stop all GPU processes including nvidia-smi | Active processes block or corrupt the VBIOS update |
| 4 | Never interrupt mid-flash | Closing terminal or losing power mid-flash bricks the GPU |
| 5 | Verify the package checksum | A corrupted .fwpkg silently fails or causes a partial flash |
| 6 | Wait for BackgroundCopyStatus: Completed | Rebooting early means the firmware never actually activates |
| 7 | Use a wired management network | VPN or WiFi drops mid-update leave the flash incomplete |
| 8 | Disable ForceUpdate immediately after downgrade | Leaving it enabled is a security risk |
| 9 | Validate before restoring workloads | Confirms the upgrade succeeded before returning node to use |
| 10 | Document pre and post versions in change ticket | Required for change management and future rollback reference |
# DGX / HGX systems nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version # Standalone PCIe GPU nvidia-smi -q | grep "VBIOS Version"
Copy the full output to your change ticket. Fill in the table:
| Component | Pre-Upgrade Version | Target Version |
|---|---|---|
| VBIOS | ||
| NVSwitch FW | ||
| EROT FW | ||
| FPGA FW | ||
| BMC | ||
| System BIOS |
nvidia-smi -L # All expected GPUs must appear nvsm show health # All components must show OK (DGX only)
<note warning> If any GPU is missing or any component is degraded — stop. Investigate before proceeding. </note>
Stop all of the following before continuing:
nvidia-smi on a schedulekubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Confirm no processes remain:
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv # Output must be completely empty before continuing
<note> Skip this step for DGX/HGX systems managed by nvfwupd. </note>
sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom
Store the .rom file on a separate host. This is your only rollback option if the flash fails.
sha256sum nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg
Compare the output against the checksum in the NVIDIA release notes. If they do not match — re-download the package before proceeding.
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg>
Every component with a version mismatch will be updated. Confirm this matches your intent before proceeding.
<note warning> Do not close your terminal, disconnect from the BMC, or allow the server to lose power once flashing begins. Any interruption mid-flash can permanently damage hardware. </note>
Updates VBIOS, NVSwitch, EROT, and FPGA firmware for all GPUs in the system.
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y
To update specific components only, create a selector JSON and pass it with -s:
{ "Components": ["VBIOS"] }
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p <pkg.fwpkg> -y -s selector.json
The firmware is written to the GPU's flash storage in the background. You must wait for this before rebooting.
curl -k -u <USER>:<PASS> \ https://<BMC_IP>/redfish/v1/Systems/HGX_Baseboard_0/Oem/Nvidia/BackgroundCopyStatus
Poll every 60 seconds. Only proceed when you see:
"BackgroundCopyStatus": "Completed"
If you see Failed — do not reboot. See the Troubleshooting section.
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle
Wait 5–10 minutes for the server to fully boot.
After the server is back online, update BMC, BIOS, and PSU firmware:
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX_<version>.fwpkg -y -s update_bmc.json
<note> The motherboard tray uses a different .fwpkg file than the GPU tray. Check the NVIDIA release notes for the correct package name for your system. </note>
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold
Wait 2–3 minutes, then confirm the BMC is back online:
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia # Confirm all modules are unloaded (output must be empty) lsmod | grep nvidia
# Flash GPU at index 0 — increment --index for additional GPUs (1, 2, ...) sudo nvflash --index=0 new_vbios_firmware.rom # NVFlash will prompt — review current vs new version shown, then type "yes"
sudo reboot
# DGX / HGX nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version # Standalone PCIe nvidia-smi -q | grep "VBIOS Version"
Every updated component must now show the new version.
nvidia-smi -L
All expected GPUs must appear with correct model names. No Unknown or error entries.
nvidia-smi -q # Check per-GPU: VBIOS version, ECC mode, temperature, power draw nvsm show health # DGX only — all components must show OK
| Field | Expected |
|---|---|
| VBIOS Version | Matches new package version |
| ECC Mode | Enabled |
| Temperature | < 50°C at idle |
| Power Draw | Within TDP spec |
nvidia-smi nvlink -s -i 0
All NVLink ports must show Active. Any inactive port must be investigated before restoring workloads.
python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
Expected: True and the correct GPU model name.
| Component | Pre-Upgrade | Post-Upgrade | Status |
|---|---|---|---|
| VBIOS | Pass / Fail | ||
| NVSwitch FW | Pass / Fail | ||
| EROT FW | Pass / Fail | ||
| FPGA FW | Pass / Fail | ||
| BMC | Pass / Fail | ||
| System BIOS | Pass / Fail |
Only proceed after all checks above pass:
kubectl uncordon <node-name>
Monitor the system for 30–60 minutes after restoring workloads.
<note warning> Only downgrade if the new firmware causes a confirmed regression. Raise an NVIDIA Support case before attempting. </note>
# Unload driver sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia # Flash the saved backup sudo nvflash --index=0 gpu0_backup_<date>.rom # Reboot sudo reboot
Then re-run all Phase 3 validation steps.
# Step 1 — Enable ForceUpdate flag (required to allow downgrade) nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable # Step 2 — Confirm flag is active nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status # Step 3 — Flash the previous firmware version nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX-HGX-H100-H200x8_<old_version>.fwpkg -y # Step 4 — Wait for BackgroundCopyStatus: Completed, then cold reboot (same as Steps 2.2–2.3) # Step 5 — MANDATORY: Disable ForceUpdate immediately after nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable # Step 6 — Confirm it is disabled nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status
Then re-run all Phase 3 validation steps.
| Symptom | Cause | Action |
|---|---|---|
| “Invalid firmware file” error | Wrong .fwpkg for the component | GPU tray = HGX package; Motherboard tray = DGX package. Re-run show_version to confirm |
| VBIOS update blocked — GPU activity | Process still using the GPU | Run nvidia-smi –query-compute-apps and kill all PIDs. Drain K8s node. Retry update |
| BackgroundCopyStatus: Failed | Wrong firmware file or transient BMC error | Do not reboot. Retry the update command. Contact NVIDIA Support if retry fails |
| Server does not boot after reboot | Failed or incomplete VBIOS flash | Connect to BMC serial console. Check ipmitool sel list. Contact NVIDIA Support — do not power cycle again |
| GPU missing after upgrade | PCIe or driver issue | Check PCIe with lspci. Check dmesg for Xid errors. Reload driver. Run nvsm show health |
| NVLink ports show Inactive | NVSwitch FW issue or fabric manager down | Cold reboot. Restart nvidia-fabricmanager service. Contact NVIDIA Support if still inactive |
| Firmware version unchanged after reboot | Rebooted before background copy finished | Re-run the update. Wait for BackgroundCopyStatus: Completed. Reboot again |
| “Downgrade not allowed” | ForceUpdate flag not enabled | Follow Section 9.2 — enable ForceUpdate before running the downgrade command |
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Wrong firmware file flashed | Medium | Critical — GPU permanently bricked | Run show_version first; verify filename matches GPU model; check checksum |
| Power loss during flash | Low | Critical — GPU permanently bricked | Stable power only; never upgrade during power maintenance work |
| Workload active during upgrade | Medium | High — update fails or GPU degraded | Verify nvidia-smi –query-compute-apps output is empty before starting |
| Driver / firmware version mismatch | Medium | High — GPU fails to initialise | Check minimum driver requirements table before upgrading |
| Network loss to BMC mid-update | Low | High — incomplete flash | Use wired network; avoid VPN or WiFi for BMC management traffic |
| Previous firmware package unavailable | Medium | High — no rollback path | Save previous .fwpkg before every upgrade |
| ForceUpdate flag left enabled | Low | Medium — security risk | Always disable and verify immediately after any downgrade |
| Reboot before background copy completes | Medium | High — firmware does not activate | Always poll BackgroundCopyStatus until Completed before rebooting |