Back to NVIDIA page

Back to BCM page

Topic: Managing Nodes & Software Images

What This Unit Covers

Creating, adding, and cloning nodes * Understanding node categories vs node groups * Creating, cloning, updating, and assigning software images * Synchronizing changes in both directions:

Practical considerations for safe rollout and rollback of changes

1. Devices vs Nodes (Foundation Concept)

In NVIDIA Base Command Manager, a device represents a cluster component. * Device types can include:

Devices have properties such as rack position, hostname, and switch port, and the operations you can perform depend on the device type. * Practical distinction:

2. Supported OS Platforms

BCM 11 is available on the following Linux platforms:

By default, the node image OS matches the head node OS, but BCM also supports changing the image OS later. * Mixed-distribution clusters are supported, but NVIDIA notes they are harder to manage and more likely to introduce issues than a consistent distribution strategy.

3. Node Types

Head Node

The head node is the central control system for the cluster. It manages devices such as compute nodes, switches, and PDUs.

It typically provides core cluster services such as:

Minimum hardware requirements for a small cluster:

NVIDIA also notes that while 4 GB is the technical minimum for an x86 head node, a standard bare-metal installation generally runs best with at least 8 GB RAM.

Compute Node

Regular nodes are the non-head nodes managed by the head node. Most regular nodes in a cluster are compute nodes. hey typically install automatically through BCM’s node provisioning system. Minimum compute-node hardware requirements listed by NVIDIA:

4. Why Node Management Matters

BCM is designed so that nodes can be returned to a known state through provisioning and synchronization from a software image. This reduces the need to log into compute nodes directly. It also makes node replacement easier after hardware failure, because the replacement node can be reprovisioned from the head node rather than rebuilt manually.

5. Node Management Operations

Common day-to-day node operations include:

Practical install/update note:

a reboot with default settings usually applies the latest image with an AUTO install .Reinstall node applies the latest image with a FULL install and may take longer.

6. Creating Nodes

Using CMSH

In `device` mode, node objects can be added with the `add` command.

NVIDIA’s documented example is:

Using Base View

Base View provides equivalent add functionality through the GUI.

Important configuration details when creating nodes

Some required values may still need to be filled in before the object validates, such as MAC address information.

For regular nodes, BCM expects network details for:

In many clusters, a regular node has one interface that serves both functions:

The BOOTIF address is typically provided by DHCP, while the management interface can be configured with a static IP by the administrator.

7. Node Creation Wizard

Base View also includes a Node Creation Wizard for adding many nodes more efficiently. This is especially useful at scale.

Important distinction:

This is different from the device identification resource, which is used to match discovered MAC addresses and switch ports to node identities.

8. Cloning Nodes (Preferred for Reuse)

Cloning is a convenient way to duplicate a fully configured object instead of rebuilding it by hand.

NVIDIA’s documented cmsh example:

Important caveats from the manual:

This is one of the most useful time-saving operations when building out larger clusters or creating patterned node definitions.

9. Node Categories (Very Important)

A node category is a group of regular nodes that share the same configuration.

Categories are one of the main scaling mechanisms in BCM because they let you:

Every regular node is in exactly one category at all times. By default, nodes are placed in the default category.

Nodes are typically separated into categories based on:

This is why categories are the main administrative control layer in BCM.

10. Category Inheritance

BCM uses category-level values as the default configuration for nodes.

For non-Boolean values:

For Boolean values:

This matters because it explains why some settings feel directly inherited while others behave more like enablement flags.

11. Categories and Software Images Do Not Have to Match One-to-One

A category has a software image as one of its configuration properties. But NVIDIA explicitly notes there is no requirement for a one-to-one relationship between categories and images. This means:

This flexibility is powerful, but it also means administrators must be deliberate about category design so image assignments stay understandable.

12. Node Groups (Different from Categories)

A node group is a convenience grouping of nodes. A node group can contain any mix of nodes, regardless of category. A node can belong to zero or more node groups at the same time. * Node groups are mainly for operating on sets of nodes together. They are not for shared configuration, because the nodes in a group do not necessarily have the same config. That is the core difference:

NVIDIA also notes node groups have use in provisioning-role configuration.

13. Software Images (Critical Concept)

A software image is the blueprint for the contents of the local filesystems on a regular node. In practice, a software image is a directory on the head node that contains a full Linux filesystem. BCM image directories are commonly managed under `/cm/images`. In a standard installation:

When a regular node boots, the provisioning system sets it up using a copy of that software image. Once the node is fully booted, BCM can also re-synchronize the node filesystem from the image without requiring a reboot. Software images can also be locked to prevent nodes from picking them up until the image is unlocked.

14. Why Cloning Software Images Is Best Practice

NVIDIA explicitly recommends cloning images before making larger changes. Reasons:

A careful administrator typically clones a known-good image before modifying it. This is especially important in production clusters where changing the active image directly increases risk. ([NVIDIA Docs][1])

15. Updating Software Images

General principle

Software images can be changed with regular Linux tools and commands. NVIDIA documents image modification through chroot-based workflows.

Examples from the BCM manual

For RHEL and derivatives:

For SLES:

For Ubuntu:

For non-packaged software, NVIDIA also documents direct placement into the image and chroot-based installation. When appropriate, NVIDIA recommends organizing shared software content under:

16. Updating Running Nodes from the Software Image

CMSH

BCM uses `imageupdate` to synchronize a running node from its image. NVIDIA’s documented example:

By default, this is a dry run. BCM tells you to review the result with `synclog`, then rerun with:

The `-w` switch performs the actual write.

Base View

Base View provides an Update Node action for the same operation.

Important operational notes

If provisioners have not been updated recently, BCM may first run `updateprovisioners`. NVIDIA notes that running `updateprovisioners` yourself just before `imageupdate` often makes sense, especially if a new image was recently created. For more extensive changes, NVIDIA says it can be safer to reboot nodes rather than rely only on `imageupdate`, because rebooting ensures the node boots into the latest image and restarts services cleanly.

17. If You Change an Image Outside Base View or CMSH

If the image is changed through CMDaemon-aware front ends such as Base View or cmsh, BCM handles provisioning-image propagation automatically. If the image is changed outside those front ends, such as by copying files directly into the image from a bash prompt, NVIDIA says `updateprovisioners` should be run manually. This is an easy thing to forget and is a common reason admins do not see changes propagate the way they expect.

18. Node-to-Image Synchronization

What it is

This is the reverse direction:

Why it is used

Critical warning

19. grabimage Command

In cmsh, the command for node-to-image synchronization is `grabimage`.

`grabimage` without `-i`

`grabimage -i <image>`

`grabimage` without `-w`

`grabimage -w`

NVIDIA’s documented example:

`grabimage -w -i default-image1 node001`

BCM also distinguishes exclude lists for:

20. What grabimage Includes and Excludes

NVIDIA notes that node-to-image sync:

Even with exclusion lists, NVIDIA warns that some unwanted changes may still get captured. That is why grabbing from a running node is powerful, but also riskier than editing the image directly.

21. Advanced Note: Software Image Revisions

BCM also supports revision control in `softwareimage` mode. A new revision can be created with `newrevision`. Revisions are named in the form:

`<parent-image>@<revision-number>`

NVIDIA documents that a new Btrfs subvolume is created for that revision. Categories can be pointed to:

This is an advanced but very useful feature for controlled lifecycle management of images.

22. Key Takeaways

23. What Actually Matters in Real Life

Node not behaving as expected

Update did not apply

You may need to ensure provisioners were updated before expecting nodes to receive recent image changes

Cluster inconsistency

[1]: https://docs.nvidia.com/dgx/baseos-on-bcm-install-guide/managing-images-bcm.html “Managing Images in BCM — Creating a BaseOS image in Base Command Manager”

Back to NVIDIA page

Back to BCM page