This is an old revision of the document!
NOTE: Flashing the wrong firmware file or interrupting mid-flash can permanently brick a GPU. Read the full runbook before executing.
This runbook covers the end-to-end firmware upgrade process for NVIDIA data center GPUs. It is self-contained — do not consult external documentation mid-upgrade.
Firmware is low-level software embedded in the GPU hardware itself. It controls boot behaviour, power management, and hardware communication — below the OS driver layer.
Components updated in a full DGX/HGX firmware upgrade:
| Component | What It Controls |
|---|---|
| VBIOS | GPU boot, clocks, power limits, PCIe config |
| NVSwitch | GPU-to-GPU interconnect (NVLink fabric) |
| EROT | Hardware root of trust / secure boot |
| FPGA | Power management and control plane |
| BMC | Out-of-band server management |
| BIOS / UEFI | Motherboard initialisation |
| PSU | Power supply unit firmware |
| GPU Architecture | Minimum Driver |
|---|---|
| Ampere (A100, A30) | R450 |
| Hopper (H100, H200) | R525 |
| Ada Lovelace (L40, L40S) | R525 |
| Tool | Purpose | Where It Runs |
|---|---|---|
| nvfwupd | Full firmware bundle update for DGX/HGX via BMC Redfish API | Management host (remote) |
| NVFlash | VBIOS-only flash for standalone PCIe GPUs | GPU host (root required) |
| nvidia-smi | Query GPU health, firmware version, running processes | GPU host |
| nvsm | DGX system-wide health check | GPU host |
| ipmitool | Remote power cycle and BMC reset | Management host |
| Platform | Tool |
|---|---|
| DGX H100 / H200 | nvfwupd |
| HGX H100 / H200 | nvfwupd |
| DGX A100 | Firmware Update Container |
| Standalone PCIe GPU (A100, RTX etc) | NVFlash |
# --- nvfwupd --- # Show installed vs package versions nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg> # Update all firmware components nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y # Update specific components only (with selector JSON) nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> update_fw -p <pkg.fwpkg> -y -s selector.json # Force downgrade controls nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable # --- NVFlash (standalone PCIe GPUs only) --- sudo nvflash --list # List all GPUs and indices sudo nvflash --version # Show VBIOS version (GPU 0) sudo nvflash --index=0 --save backup_$(date +%Y%m%d).rom # Backup current VBIOS sudo nvflash --index=0 new_firmware.rom # Flash new VBIOS to GPU 0 # --- nvidia-smi --- nvidia-smi -L # List all GPUs nvidia-smi -q # Full GPU info nvidia-smi -q | grep "VBIOS Version" # VBIOS version only nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv # Active GPU processes nvidia-smi nvlink -s -i 0 # NVLink status for GPU 0 # --- ipmitool --- ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle # Cold reboot ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold # BMC-only reset ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info # Confirm BMC online ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> sel list # BMC event log # --- nvsm --- nvsm show health # System-wide health (DGX only)
Complete every item below before starting. Do not proceed if any item cannot be confirmed.
nvidia-smi is not running — even passive queries block VBIOS updatesnvsm show health or BMC dashboard| # | Practice | Why It Matters |
|---|---|---|
| 1 | Always run show_version first | Confirms exactly what will change before any flashing begins |
| 2 | Back up VBIOS before flashing | No backup = no rollback path for standalone GPU upgrades |
| 3 | Stop all GPU processes including nvidia-smi | Active processes block or corrupt the VBIOS update |
| 4 | Never interrupt mid-flash | Closing terminal or losing power mid-flash bricks the GPU |
| 5 | Verify the package checksum | A corrupted .fwpkg silently fails or causes a partial flash |
| 6 | Wait for BackgroundCopyStatus: Completed | Rebooting early means the firmware never actually activates |
| 7 | Use a wired management network | VPN or WiFi drops mid-update leave the flash incomplete |
| 8 | Disable ForceUpdate immediately after downgrade | Leaving it enabled is a security risk |
| 9 | Validate before restoring workloads | Confirms the upgrade succeeded before returning node to use |
| 10 | Document pre and post versions in change ticket | Required for change management and future rollback reference |
# DGX / HGX systems nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version # Standalone PCIe GPU nvidia-smi -q | grep "VBIOS Version"
Copy the full output to your change ticket. Fill in the table:
| Component | Pre-Upgrade Version | Target Version |
|---|---|---|
| VBIOS | ||
| NVSwitch FW | ||
| EROT FW | ||
| FPGA FW | ||
| BMC | ||
| System BIOS |
nvidia-smi -L # All expected GPUs must appear nvsm show health # All components must show OK (DGX only)
If any GPU is missing or any component is degraded — stop. Investigate before proceeding.
Stop all of the following before continuing:
nvidia-smi on a schedulekubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Confirm no processes remain:
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv # Output must be completely empty before continuing
Skip this step for DGX/HGX systems managed by nvfwupd.
sudo nvflash --index=0 --save gpu0_backup_$(date +%Y%m%d).rom
Store the .rom file on a separate host. This is your only rollback option if the flash fails.
sha256sum nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg
Compare the output against the checksum in the NVIDIA release notes. If they do not match — re-download the package before proceeding.
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version -p <pkg.fwpkg>
Every component with a version mismatch will be updated. Confirm this matches your intent before proceeding.
Do not close your terminal, disconnect from the BMC, or allow the server to lose power once flashing begins. Any interruption mid-flash can permanently damage hardware.
Updates VBIOS, NVSwitch, EROT, and FPGA firmware for all GPUs in the system.
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX-HGX-H100-H200x8_<version>.fwpkg -y
To update specific components only, create a selector JSON and pass it with -s:
{ "Components": ["VBIOS"] }
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p <pkg.fwpkg> -y -s selector.json
The firmware is written to the GPU's flash storage in the background. You must wait for this before rebooting.
curl -k -u <USER>:<PASS> \ https://<BMC_IP>/redfish/v1/Systems/HGX_Baseboard_0/Oem/Nvidia/BackgroundCopyStatus
Poll every 60 seconds. Only proceed when you see:
"BackgroundCopyStatus": "Completed"
If you see Failed — do not reboot. See the Troubleshooting section.
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis power cycle
Wait 5–10 minutes for the server to fully boot.
After the server is back online, update BMC, BIOS, and PSU firmware:
nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX_<version>.fwpkg -y -s update_bmc.json
The motherboard tray uses a different .fwpkg file than the GPU tray. Check the NVIDIA release notes for the correct package name for your system.
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc reset cold
Wait 2–3 minutes, then confirm the BMC is back online:
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> mc info
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia # Confirm all modules are unloaded (output must be empty) lsmod | grep nvidia
# Flash GPU at index 0 — increment --index for additional GPUs (1, 2, ...) sudo nvflash --index=0 new_vbios_firmware.rom # NVFlash will prompt — review current vs new version shown, then type "yes"
sudo reboot
# DGX / HGX nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> show_version # Standalone PCIe nvidia-smi -q | grep "VBIOS Version"
Every updated component must now show the new version.
nvidia-smi -L
All expected GPUs must appear with correct model names. No Unknown or error entries.
nvidia-smi -q # Check per-GPU: VBIOS version, ECC mode, temperature, power draw nvsm show health # DGX only — all components must show OK
| Field | Expected |
|---|---|
| VBIOS Version | Matches new package version |
| ECC Mode | Enabled |
| Temperature | < 50°C at idle |
| Power Draw | Within TDP spec |
nvidia-smi nvlink -s -i 0
All NVLink ports must show Active. Any inactive port must be investigated before restoring workloads.
python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
Expected: True and the correct GPU model name.
| Component | Pre-Upgrade | Post-Upgrade | Status |
|---|---|---|---|
| VBIOS | Pass / Fail | ||
| NVSwitch FW | Pass / Fail | ||
| EROT FW | Pass / Fail | ||
| FPGA FW | Pass / Fail | ||
| BMC | Pass / Fail | ||
| System BIOS | Pass / Fail |
Only proceed after all checks above pass:
kubectl uncordon <node-name>
Monitor the system for 30–60 minutes after restoring workloads.
Only downgrade if the new firmware causes a confirmed regression. Raise an NVIDIA Support case before attempting.
# Unload driver sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia # Flash the saved backup sudo nvflash --index=0 gpu0_backup_<date>.rom # Reboot sudo reboot
Then re-run all Phase 3 validation steps.
# Step 1 — Enable ForceUpdate flag (required to allow downgrade) nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update enable # Step 2 — Confirm flag is active nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status # Step 3 — Flash the previous firmware version nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> \ update_fw -p nvfw_DGX-HGX-H100-H200x8_<old_version>.fwpkg -y # Step 4 — Wait for BackgroundCopyStatus: Completed, then cold reboot (same as Steps 2.2–2.3) # Step 5 — MANDATORY: Disable ForceUpdate immediately after nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update disable # Step 6 — Confirm it is disabled nvfwupd -t ip=<BMC_IP> user=<USER> password=<PASS> force_update status
Then re-run all Phase 3 validation steps.
| Symptom | Cause | Action |
|---|---|---|
| “Invalid firmware file” error | Wrong .fwpkg for the component | GPU tray = HGX package; Motherboard tray = DGX package. Re-run show_version to confirm |
| VBIOS update blocked — GPU activity | Process still using the GPU | Run nvidia-smi –query-compute-apps and kill all PIDs. Drain K8s node. Retry update |
| BackgroundCopyStatus: Failed | Wrong firmware file or transient BMC error | Do not reboot. Retry the update command. Contact NVIDIA Support if retry fails |
| Server does not boot after reboot | Failed or incomplete VBIOS flash | Connect to BMC serial console. Check ipmitool sel list. Contact NVIDIA Support — do not power cycle again |
| GPU missing after upgrade | PCIe or driver issue | Check PCIe with lspci. Check dmesg for Xid errors. Reload driver. Run nvsm show health |
| NVLink ports show Inactive | NVSwitch FW issue or fabric manager down | Cold reboot. Restart nvidia-fabricmanager service. Contact NVIDIA Support if still inactive |
| Firmware version unchanged after reboot | Rebooted before background copy finished | Re-run the update. Wait for BackgroundCopyStatus: Completed. Reboot again |
| “Downgrade not allowed” | ForceUpdate flag not enabled | Follow Section 9.2 — enable ForceUpdate before running the downgrade command |
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Wrong firmware file flashed | Medium | Critical — GPU permanently bricked | Run show_version first; verify filename matches GPU model; check checksum |
| Power loss during flash | Low | Critical — GPU permanently bricked | Stable power only; never upgrade during power maintenance work |
| Workload active during upgrade | Medium | High — update fails or GPU degraded | Verify nvidia-smi –query-compute-apps output is empty before starting |
| Driver / firmware version mismatch | Medium | High — GPU fails to initialise | Check minimum driver requirements table before upgrading |
| Network loss to BMC mid-update | Low | High — incomplete flash | Use wired network; avoid VPN or WiFi for BMC management traffic |
| Previous firmware package unavailable | Medium | High — no rollback path | Save previous .fwpkg before every upgrade |
| ForceUpdate flag left enabled | Low | Medium — security risk | Always disable and verify immediately after any downgrade |
| Reboot before background copy completes | Medium | High — firmware does not activate | Always poll BackgroundCopyStatus until Completed before rebooting |