This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| wiki:ai:nvidia_bcm-intro [2026/04/07 14:14] – created scouto | wiki:ai:nvidia_bcm-intro [2026/04/07 16:49] (current) – scouto | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ## Base Command Manager (BCM) Intro | + | [[wiki: |
| - | ### What NVIDIA Base Command Manager Is | + | [[wiki: |
| - | * NVIDIA | + | **Base Command Manager (BCM) Intro |
| - | * Its purpose is to streamline | + | |
| - | * NVIDIA describes BCM as providing the tools needed to **deploy and manage an AI data center**, not just a few isolated servers. | + | |
| - | * BCM is designed for environments that need repeatable deployment, centralized administration, | + | |
| - | ### Core Idea Behind BCM | + | **What NVIDIA Base Command Manager Is** |
| - | * BCM is meant to manage the **full cluster lifecycle**: | + | NVIDIA Base Command Manager (BCM) is NVIDIA’s cluster management platform for AI and HPC environments. Its purpose is to streamline |
| + | |||
| + | **Core Idea Behind BCM** | ||
| + | |||
| + | BCM is meant to manage the **full cluster lifecycle**: | ||
| * initial installation | * initial installation | ||
| Line 20: | Line 21: | ||
| * accounting and chargeback | * accounting and chargeback | ||
| * scaling and cloud extension | * scaling and cloud extension | ||
| - | * That is an important point: BCM is not just an installer. It is an ongoing **operations and lifecycle management framework** for clusters. ([NVIDIA Docs][2]) | ||
| - | ### What BCM Helps Administrators Do | + | That is an important point: |
| - | * BCM contains tools and applications to help administrators: | + | **What BCM Helps Administrators Do** |
| + | |||
| + | BCM contains tools and applications to help administrators: | ||
| * install a cluster | * install a cluster | ||
| Line 30: | Line 32: | ||
| * monitor nodes and jobs | * monitor nodes and jobs | ||
| * keep the environment consistent over time | * keep the environment consistent over time | ||
| - | * NVIDIA also positions BCM as a way to provide users with an optimized environment for running compute-intensive applications. In other words, BCM is not only about infrastructure control, but also about creating a usable and reliable compute platform for end users. ([NVIDIA Docs][2]) | ||
| - | ### Centralized Management Model | + | NVIDIA also positions BCM as a way to provide users with an optimized environment for running compute-intensive applications. In other words, BCM is not only about infrastructure control, but also about creating a usable and reliable compute platform for end users. ([NVIDIA Docs][2]) |
| - | * BCM centralizes cluster control rather than forcing administrators to manage each node individually. | + | **Centralized Management Model** |
| - | * Two major management interfaces are provided: | + | |
| + | BCM centralizes cluster control rather than forcing administrators to manage each node individually. * Two major management interfaces are provided: | ||
| * **Base View**: the graphical web interface | * **Base View**: the graphical web interface | ||
| * **cmsh**: the command-line shell | * **cmsh**: the command-line shell | ||
| - | * These interfaces expose the same cluster management functionality, | ||
| - | * Base View is installed by default and provides an encrypted web interface for cluster administration and monitoring. ([NVIDIA Docs][2]) | ||
| - | ### What Types of Systems BCM Can Manage | + | These interfaces expose the same cluster management functionality, |
| + | |||
| + | **What Types of Systems BCM Can Manage** | ||
| - | * BCM is built to manage more than just standard compute nodes. NVIDIA documentation shows BCM can represent and manage device types such as: | + | BCM is built to manage more than just standard compute nodes. NVIDIA documentation shows BCM can represent and manage device types such as: |
| * head nodes | * head nodes | ||
| Line 52: | Line 54: | ||
| * GPU units | * GPU units | ||
| * switches | * switches | ||
| - | * That matters because BCM is intended for real cluster environments, | ||
| - | ## Platform | + | That matters because BCM is intended for real cluster environments, |
| - | ### Operating System Support | + | **Platform and Architecture Support |
| - | * NVIDIA BCM 11 supports several Linux distributions, | + | NVIDIA BCM 11 supports several Linux distributions, |
| * Rocky Linux 8 and 9 | * Rocky Linux 8 and 9 | ||
| * SUSE Linux Enterprise Server 15 | * SUSE Linux Enterprise Server 15 | ||
| * Ubuntu 22.04 and 24.04 | * Ubuntu 22.04 and 24.04 | ||
| - | * The installation documentation also notes support for: | + | |
| + | The installation documentation also notes support for: | ||
| * Red Hat Enterprise Linux derivatives | * Red Hat Enterprise Linux derivatives | ||
| * DGX OS 7 | * DGX OS 7 | ||
| - | * BCM is built on top of major Linux distributions rather than being its own standalone operating system. ([NVIDIA Docs][1]) | ||
| - | ### Processor Architecture Support | + | BCM is built on top of major Linux distributions rather than being its own standalone operating system. ([NVIDIA Docs][1]) |
| - | * BCM supports: | + | **Processor Architecture Support** |
| + | |||
| + | BCM supports: | ||
| * **x86_64** | * **x86_64** | ||
| * **arm64 / AArch64** | * **arm64 / AArch64** | ||
| - | * NVIDIA states that BCM aims to be largely independent of operating system and processor architecture, | ||
| - | ### Mixed Operating System Environments | + | NVIDIA states that BCM aims to be largely independent of operating system and processor architecture, |
| - | * BCM can create software images using different supported Linux distributions for compute nodes. | + | **Mixed Operating System Environments** |
| - | * NVIDIA recommends keeping compute nodes on the same distribution as the head node when possible, because mixed-distribution clusters are generally harder to manage. | + | |
| - | * So the practical exam and operations takeaway is: | + | BCM can create software images using different supported Linux distributions for compute nodes. * NVIDIA recommends keeping compute nodes on the same distribution as the head node when possible, because mixed-distribution clusters are generally harder to manage. * So the practical exam and operations takeaway is: |
| * BCM is flexible enough for mixed environments | * BCM is flexible enough for mixed environments | ||
| * but same-OS deployments are simpler and usually preferred ([NVIDIA Docs][1]) | * but same-OS deployments are simpler and usually preferred ([NVIDIA Docs][1]) | ||
| - | ## Head Node and Software Image Model | + | **Head Node and Software Image Model Head Node Role** |
| - | + | ||
| - | ### Head Node Role | + | |
| - | * BCM is typically installed on: | + | BCM is typically installed on: |
| * a single head node, or | * a single head node, or | ||
| * a high-availability head-node pair | * a high-availability head-node pair | ||
| - | * The head node is the control point for provisioning and management. | ||
| - | * In common cluster designs, the head node also provides key cluster services needed for provisioning and operation. ([NVIDIA Docs][3]) | ||
| - | ### Software Images | + | The head node is the control point for provisioning and management. * In common cluster designs, the head node also provides key cluster services needed for provisioning and operation. ([NVIDIA Docs][3]) |
| - | * One of the most important BCM concepts is the **software image**. | + | **Software Images** |
| - | * NVIDIA describes a software image as a blueprint for a node’s filesystem. It lives on the head node and is used to provision regular nodes. | + | |
| - | * The default image is commonly named **default-image**. | + | |
| - | * When a regular node boots for provisioning, | + | |
| - | * This is a major reason BCM helps prevent drift: nodes are not all manually configured one by one; they are built from managed images. ([NVIDIA Docs][2]) | + | |
| - | ### Why Software Images Matter | + | One of the most important BCM concepts is the **software image**. NVIDIA describes a software image as a blueprint for a node’s filesystem. It lives on the head node and is used to provision regular nodes. The default image is commonly named **default-image**. When a regular node boots for provisioning, |
| - | * Software images make cluster administration more repeatable and scalable. | + | **Why Software Images Matter** |
| - | * Instead of logging into every server and making one-off changes, administrators can: | + | |
| + | Software images make cluster administration more repeatable and scalable. * Instead of logging into every server and making one-off changes, administrators can: | ||
| * update the image | * update the image | ||
| * reprovision or resynchronize nodes | * reprovision or resynchronize nodes | ||
| * keep systems aligned with a known desired state | * keep systems aligned with a known desired state | ||
| - | * NVIDIA documents that filesystem changes from the image can be redistributed to nodes through synchronization, | ||
| - | ## Installation | + | NVIDIA documents that filesystem changes from the image can be redistributed to nodes through synchronization, |
| - | ### Installation Flow | + | **Installation and Initial Cluster Bring-Up |
| - | * BCM installation can begin through either: | + | BCM installation can begin through either: |
| * a graphical installer, or | * a graphical installer, or | ||
| * a text-based installer | * a text-based installer | ||
| - | * During installation, | + | |
| + | During installation, | ||
| * nameservers | * nameservers | ||
| Line 130: | Line 125: | ||
| * networking behavior | * networking behavior | ||
| * BMC configuration | * BMC configuration | ||
| - | * This is why BCM installation is more than just putting software on one box. It is the start of building the management plane for the whole cluster. ([NVIDIA Docs][3]) | ||
| - | ### Services Commonly Provided by the Head Node | + | This is why BCM installation is more than just putting software on one box. It is the start of building the management plane for the whole cluster. ([NVIDIA Docs][3]) |
| + | |||
| + | **Services Commonly Provided by the Head Node** | ||
| - | * NVIDIA documentation shows BCM commonly manages or generates configuration for services such as: | + | NVIDIA documentation shows BCM commonly manages or generates configuration for services such as: |
| * DHCP | * DHCP | ||
| Line 141: | Line 137: | ||
| * NFS | * NFS | ||
| * LDAP-related services | * LDAP-related services | ||
| - | * In the default service role set, services such as `dhcpd`, `named`, `slapd`, `nfs`, and `cuda-dcgm` appear as part of BCM’s managed cluster service model. | ||
| - | * This is a key exam concept: the head node is not just a login box. It is often the operational backbone of provisioning and cluster services. ([NVIDIA Docs][2]) | ||
| - | ### Network Boot and Provisioning | + | In the default service role set, services such as `dhcpd`, `named`, `slapd`, `nfs`, |
| - | * BCM uses network boot workflows for provisioning nodes. | + | **Network Boot and Provisioning** |
| - | * NVIDIA documents DHCP and TFTP as part of the boot and provisioning path. | + | |
| - | * In a common Type 1 network setup, the head node provides DHCP and PXE-related services to an isolated worker-node network. | + | |
| - | * This is what enables BCM to discover and provision systems in a more automated way than manual per-node installs. ([NVIDIA Docs][3]) | + | |
| - | ## Why BCM Exists: Operational Value | + | BCM uses network boot workflows for provisioning nodes. * NVIDIA documents DHCP and TFTP as part of the boot and provisioning path. * In a common Type 1 network setup, the head node provides DHCP and PXE-related services to an isolated worker-node network. * This is what enables BCM to discover and provision systems in a more automated way than manual per-node installs. ([NVIDIA Docs][3]) |
| - | ### 1. Faster, More Repeatable Provisioning | + | **Why BCM Exists: Operational Value ** |
| - | * BCM reduces manual cluster build work by using: | + | **1. Faster, More Repeatable Provisioning** |
| + | |||
| + | BCM reduces manual cluster build work by using: | ||
| * centralized installation | * centralized installation | ||
| Line 161: | Line 154: | ||
| * software images | * software images | ||
| * automated node provisioning | * automated node provisioning | ||
| - | * The practical benefit is that a cluster can be built and expanded with much more consistency than a manual “install every server separately” model. ([NVIDIA Docs][1]) | ||
| - | ### 2. Reduced Administrative Complexity | + | The practical benefit is that a cluster can be built and expanded with much more consistency than a manual “install every server separately” model. ([NVIDIA Docs][1]) |
| - | * BCM gives administrators one control plane for many tasks that would otherwise be scattered across scripts, separate tools, and manual node access. | + | **2. Reduced Administrative Complexity** |
| - | * Instead of treating every node as a special case, BCM encourages image-based, | + | |
| - | * Base View and cmsh both support this centralized model. ([NVIDIA Docs][2]) | + | |
| - | ### 3. Better Consistency Across the Cluster | + | BCM gives administrators one control plane for many tasks that would otherwise be scattered across scripts, separate tools, and manual node access. Instead of treating every node as a special case, BCM encourages image-based, |
| - | * Because nodes are provisioned from centrally managed images, BCM helps reduce configuration drift. | + | **3. Better Consistency Across the Cluster** |
| - | * Image locking, synchronization, | + | |
| - | ### 4. Better Visibility Into Cluster Health | + | Because nodes are provisioned from centrally managed images, BCM helps reduce configuration drift. Image locking, synchronization, |
| - | * BCM is not blind after deployment. It includes cluster monitoring, job monitoring, health checks, accounting, | + | **4. Better Visibility Into Cluster Health |
| - | * That means it supports both infrastructure operations and operational reporting. ([NVIDIA Docs][2]) | + | |
| - | ## Monitoring, Health Checks, and Visibility | + | BCM is not blind after deployment. It includes cluster monitoring, job monitoring, health checks, accounting, and chargeback capabilities. That means it supports both infrastructure operations |
| - | ### Cluster | + | **Monitoring, Health Checks, and Visibility** |
| - | * BCM includes built-in monitoring capabilities for cluster resources. | + | **Cluster Monitoring** |
| - | * Base View includes dashboards and monitoring views, and NVIDIA documents support for PromQL queries and drilldowns. | + | |
| - | * BCM can also interact with Prometheus-style monitoring workflows through CMDaemon exporter capabilities. ([NVIDIA Docs][2]) | + | |
| - | ### Job Monitoring | + | BCM includes built-in monitoring capabilities for cluster resources. Base View includes dashboards and monitoring views, and NVIDIA documents support for PromQL queries and drilldowns. BCM can also interact with Prometheus-style monitoring workflows through CMDaemon exporter capabilities. ([NVIDIA Docs][2]) |
| - | * BCM supports | + | **Job Monitoring** |
| - | * NVIDIA documents that BCM begins updating job information in its database as soon as it detects a new job. | + | |
| - | * Job monitoring includes visibility into resource usage such as CPU, memory, storage, and other job-related metrics. ([NVIDIA Docs][2]) | + | |
| - | ### Health Checks | + | BCM supports **job-centric monitoring**, |
| - | * Health checks are an important BCM capability. | + | **Health Checks** |
| - | * NVIDIA documents that health checks can run as: | + | |
| + | Health checks are an important BCM capability. * NVIDIA documents that health checks can run as: | ||
| * scheduled checks | * scheduled checks | ||
| * pre-job checks | * pre-job checks | ||
| * post-job checks | * post-job checks | ||
| - | * A very important exam point: | + | |
| + | A very important exam point: | ||
| * if a **pre-job** health check fails, the job can be prevented from running on that node | * if a **pre-job** health check fails, the job can be prevented from running on that node | ||
| * a common corrective action is to drain that node | * a common corrective action is to drain that node | ||
| - | * BCM health checks can also launch actions and support notification behavior. | ||
| - | * NVIDIA also provides example health check scripts, which shows the platform supports extension and customization. ([NVIDIA Docs][2]) | ||
| - | ## Workload Managers | + | BCM health checks can also launch actions |
| - | ### Workload Manager Integration | + | **Workload Managers and Orchestration |
| - | * BCM supports workload management setup from the cluster management plane. | + | BCM supports workload management setup from the cluster management plane. NVIDIA documents setup tooling such as `cm-wlm-setup`. The Administrator Manual includes workload manager chapters for: |
| - | * NVIDIA documents setup tooling such as `cm-wlm-setup`. | + | |
| - | * The Administrator Manual includes workload manager chapters for: | + | |
| * Slurm | * Slurm | ||
| * PBS | * PBS | ||
| * LSF | * LSF | ||
| - | * That means workload orchestration is not an afterthought. It is built into BCM’s operating model. ([NVIDIA Docs][3]) | ||
| - | ### Kubernetes Integration | + | That means workload orchestration is not an afterthought. It is built into BCM’s operating model. ([NVIDIA Docs][3]) |
| - | * BCM also supports Kubernetes-related integration. | + | **Kubernetes |
| - | * NVIDIA documentation references `cm-kubernetes-setup` for streamlining | + | |
| - | * This matters because BCM is meant to support modern AI/HPC environments that may use both traditional schedulers and container orchestration platforms. ([NVIDIA Docs][4]) | + | |
| - | ## Auto Scaling | + | BCM also supports Kubernetes-related integration. NVIDIA documentation references `cm-kubernetes-setup` for streamlining Kubernetes-related deployment flows. This matters because BCM is meant to support modern AI/HPC environments that may use both traditional schedulers |
| - | ### cm-scale / Auto Scaler | + | **Auto Scaling and Dynamic Resource Use** |
| - | * NVIDIA documents **cm-scale** as BCM’s mechanism for dynamic scaling based on workload demand. | + | **cm-scale / Auto Scaler** |
| - | * It can work with: | + | |
| + | NVIDIA documents **cm-scale** as BCM’s mechanism for dynamic scaling based on workload demand. * It can work with: | ||
| * static node providers | * static node providers | ||
| * dynamic node providers | * dynamic node providers | ||
| - | * BCM can use this to: | + | |
| + | BCM can use this to: | ||
| * power nodes on or off | * power nodes on or off | ||
| * clone or remove dynamic nodes | * clone or remove dynamic nodes | ||
| * repurpose nodes for different categories or overlays | * repurpose nodes for different categories or overlays | ||
| - | * This is a major concept: BCM can adjust cluster resources according to demand instead of forcing a permanently fixed hardware role for every node. ([NVIDIA Docs][2]) | ||
| - | ### Reprovisioning and Repurposing | + | This is a major concept: BCM can adjust cluster resources according to demand instead of forcing a permanently fixed hardware role for every node. ([NVIDIA Docs][2]) |
| + | |||
| + | **Reprovisioning and Repurposing** | ||
| - | * NVIDIA documents that when job demand changes, BCM can: | + | NVIDIA documents that when job demand changes, BCM can: |
| * drain a node | * drain a node | ||
| Line 252: | Line 234: | ||
| * move it to different overlays | * move it to different overlays | ||
| * reboot or boot it into a new role | * reboot or boot it into a new role | ||
| - | * In practical terms, that means the same physical infrastructure can be repurposed to serve different workload needs. ([NVIDIA Docs][2]) | ||
| - | ### Kubernetes and Queue Awareness | + | In practical terms, that means the same physical infrastructure can be repurposed to serve different workload needs. ([NVIDIA Docs][2]) |
| - | * BCM’s scaling model is not limited to just one scheduler type. | + | **Kubernetes and Queue Awareness** |
| - | * NVIDIA documents queue-based thresholds for cloudbursting and also notes namespace tracking support for Kubernetes | + | |
| - | * So the platform is designed to react to workload pressure, not just machine states. ([NVIDIA Docs][2]) | + | |
| - | ## Cloud Capabilities | + | BCM’s scaling model is not limited to just one scheduler type. NVIDIA documents queue-based thresholds for cloudbursting and also notes namespace tracking support for Kubernetes jobs and pods. So the platform is designed to react to workload pressure, not just machine states. ([NVIDIA Docs][2]) |
| - | ### Cluster On Demand | + | **Cloud Capabilities** |
| - | * BCM cloud documentation for current releases includes **Cluster On Demand** workflows for multiple cloud providers, including: | + | **Cluster On Demand** |
| + | |||
| + | BCM cloud documentation for current releases includes **Cluster On Demand** workflows for multiple cloud providers, including: | ||
| * AWS | * AWS | ||
| Line 270: | Line 251: | ||
| * OCI | * OCI | ||
| * GCP | * GCP | ||
| - | * This allows BCM-managed cluster capacity to be created in cloud environments using BCM tooling rather than treating cloud as a completely separate management domain. ([NVIDIA Docs][5]) | ||
| - | ### Cluster Extension | + | This allows BCM-managed cluster capacity to be created in cloud environments using BCM tooling rather than treating cloud as a completely separate management domain. ([NVIDIA Docs][5]) |
| - | * BCM also supports **Cluster Extension**, | + | **Cluster Extension** |
| - | * NVIDIA specifically documents cluster extension to **AWS** and **Azure**. | + | |
| - | * An important distinction: | + | BCM also supports **Cluster Extension**, |
| * in Cluster Extension, the **head node remains outside the cloud** | * in Cluster Extension, the **head node remains outside the cloud** | ||
| * cloud nodes are added as an extension of the managed cluster | * cloud nodes are added as an extension of the managed cluster | ||
| - | * That makes Cluster Extension a hybrid-cloud capacity model rather than a fully cloud-native isolated cluster. ([NVIDIA Docs][5]) | ||
| - | ## Edge Capabilities | + | That makes Cluster Extension a hybrid-cloud capacity model rather than a fully cloud-native isolated cluster. ([NVIDIA Docs][5]) |
| - | ### BCM at the Edge | + | **Edge Capabilities** |
| - | * BCM includes edge computing capabilities documented in a dedicated Edge Manual. | + | **BCM at the Edge** |
| - | * NVIDIA describes edge computing as useful when local processing is needed because of: | + | |
| + | BCM includes edge computing capabilities documented in a dedicated Edge Manual. * NVIDIA describes edge computing as useful when local processing is needed because of: | ||
| * low latency requirements | * low latency requirements | ||
| * high bandwidth demands | * high bandwidth demands | ||
| * high CPU-cycle needs | * high CPU-cycle needs | ||
| - | * BCM’s edge model allows a central cluster to manage resources distributed across multiple locations. ([NVIDIA Docs][6]) | ||
| - | ### One Cluster, Multiple Locations | + | BCM’s edge model allows a central cluster to manage resources distributed across multiple locations. ([NVIDIA Docs][6]) |
| + | |||
| + | **One Cluster, Multiple Locations** | ||
| - | * NVIDIA explicitly describes the edge model as **one cluster, multiple locations**. | + | NVIDIA explicitly describes the edge model as **one cluster, multiple locations**. * In this design: |
| - | * In this design: | + | |
| * a central head node manages the broader cluster | * a central head node manages the broader cluster | ||
| * edge sites include an edge director and edge nodes | * edge sites include an edge director and edge nodes | ||
| * edge nodes can PXE boot from the edge director | * edge nodes can PXE boot from the edge director | ||
| - | * NVIDIA also notes that direct connectivity between the central head node and edge nodes is not required in the same way, because the edge director acts as the local control point. ([NVIDIA Docs][6]) | ||
| - | ## NVIDIA-Specific GPU Integration | + | NVIDIA |
| - | ### DCGM and GPU Awareness | + | **NVIDIA-Specific |
| - | * BCM has direct NVIDIA GPU integration through `cuda-dcgm`. | + | **DCGM and GPU Awareness** |
| - | * NVIDIA documentation states that the `cuda-dcgm` package is required to access NVIDIA | + | |
| - | * That is important because BCM is not merely cluster-aware; | + | |
| - | ### GPU Metrics | + | BCM has direct NVIDIA |
| - | * BCM can collect and expose | + | **GPU Metrics** |
| - | * NVIDIA documentation shows examples of GPU metrics and GPU-related monitoring fields such as utilization, | + | |
| - | ### GPU Health Checks | + | BCM can collect and expose |
| - | * NVIDIA documents multiple GPU health checks, including checks related to: | + | **GPU Health Checks** |
| + | |||
| + | NVIDIA documents multiple GPU health checks, including checks related to: | ||
| * overall GPU health | * overall GPU health | ||
| Line 327: | Line 305: | ||
| * thermal | * thermal | ||
| * PMU | * PMU | ||
| - | * This makes BCM especially relevant for DGX and other GPU cluster environments where hardware acceleration health directly affects workload success. ([NVIDIA Docs][2]) | ||
| - | ## CUDA and Software Environment Management | + | This makes BCM especially relevant for DGX and other GPU cluster environments where hardware acceleration health directly affects workload success. ([NVIDIA Docs][2]) |
| - | ### Multiple | + | **CUDA and Software Environment Management** |
| - | * BCM supports management of installed CUDA versions and environment control through modules. | + | **Multiple |
| - | * NVIDIA documentation shows multiple | + | |
| - | * This matters in real AI/HPC environments because different applications may require different CUDA versions. BCM supports that kind of controlled multi-version environment rather than forcing one CUDA stack for everyone. ([NVIDIA Docs][3]) | + | |
| - | ### Environment Modules | + | BCM supports management of installed CUDA versions and environment control through modules. * NVIDIA documentation shows multiple CUDA versions being available in the environment and selected as needed. * This matters in real AI/HPC environments because different applications may require different CUDA versions. BCM supports that kind of controlled multi-version environment rather than forcing one CUDA stack for everyone. ([NVIDIA Docs][3]) |
| - | * BCM supports environment modules, with Tcl Modules | + | **Environment |
| - | * Administrators can configure default modules cluster-wide. | + | |
| - | * This is an important usability feature because it standardizes how users load software stacks such as compilers, libraries, and CUDA-related components. ([NVIDIA Docs][3]) | + | |
| - | ## Accounting | + | BCM supports environment modules, with Tcl Modules as the default |
| - | ### Job Accounting | + | **Accounting and Chargeback** ** Job Accounting** |
| - | * BCM supports job accounting, which NVIDIA distinguishes from raw job monitoring. | + | BCM supports job accounting, which NVIDIA distinguishes from raw job monitoring. * Monitoring focuses on resource consumption per job as it runs. * Accounting aggregates resource consumption by groupings such as: |
| - | * Monitoring focuses on resource consumption per job as it runs. | + | |
| - | * Accounting aggregates resource consumption by groupings such as: | + | |
| * user | * user | ||
| Line 355: | Line 326: | ||
| * account | * account | ||
| * job-related dimensions | * job-related dimensions | ||
| - | * This gives administrators and organizations a way to understand resource use at an ownership or cost-center level. ([NVIDIA Docs][2]) | ||
| - | ### Chargeback | + | This gives administrators and organizations a way to understand resource use at an ownership or cost-center level. ([NVIDIA Docs][2]) |
| + | |||
| + | **Chargeback** | ||
| - | * BCM includes chargeback reporting capabilities. | + | BCM includes chargeback reporting capabilities. * NVIDIA documents predefined chargeback models based on metrics such as: |
| - | * NVIDIA documents predefined chargeback models based on metrics such as: | + | |
| * CPU second | * CPU second | ||
| Line 368: | Line 339: | ||
| * memory byte-second | * memory byte-second | ||
| * power watt-second | * power watt-second | ||
| - | * Chargeback reports can also be grouped by fields such as user, group, account, job name, and job ID. | ||
| - | * That makes BCM useful not just for technical operations, but also for internal cost allocation and usage reporting. ([NVIDIA Docs][2]) | ||
| - | ## Practical Exam and Study Takeaways | + | Chargeback reports can also be grouped by fields such as user, group, account, job name, and job ID. * That makes BCM useful not just for technical operations, but also for internal cost allocation and usage reporting. ([NVIDIA Docs][2]) |
| - | ### What BCM Really Is | + | **Takeaways** |
| - | * BCM is best understood as **end-to-end cluster lifecycle management**, | + | BCM is best understood as **end-to-end cluster lifecycle management**, |
| - | * It combines: | + | |
| * installation | * installation | ||
| Line 388: | Line 356: | ||
| * GPU-aware operations ([NVIDIA Docs][1]) | * GPU-aware operations ([NVIDIA Docs][1]) | ||
| - | ### What Makes BCM Strong | + | **What Makes BCM Strong |
| - | * The biggest strengths of BCM are: | + | The biggest strengths of BCM are: |
| * centralized management | * centralized management | ||
| Line 400: | Line 368: | ||
| * deep NVIDIA GPU integration ([NVIDIA Docs][7]) | * deep NVIDIA GPU integration ([NVIDIA Docs][7]) | ||
| - | ### High-Value Concepts to Remember | + | ** High-Value Concepts to Remember** |
| + | |||
| + | * **Head node** is the main management and provisioning control point. | ||
| + | * **Software image** is the blueprint used to provision nodes. | ||
| + | * **Base View** is the main GUI; **cmsh** is the CLI. | ||
| + | * **Pre-job health check failure** can block a job from running on a node. | ||
| + | * **cm-scale** supports dynamic scaling and node repurposing. | ||
| + | * **Cluster Extension** keeps the head node outside the cloud while adding cloud capacity. | ||
| + | * **cuda-dcgm** is important for NVIDIA GPU metrics and GPU-aware monitoring. ([NVIDIA Docs][2]) | ||
| + | |||
| + | [1]: [[https:// | ||
| + | |||
| + | [2]: [[https:// | ||
| + | |||
| + | [3]: [[https:// | ||
| + | |||
| + | [4]: [[https:// | ||
| + | |||
| + | [5]: [[https:// | ||
| + | |||
| + | [6]: [[https:// | ||
| - | * **Head node** is the main management and provisioning control point. | + | [7]: [[https:// |
| - | * **Software image** is the blueprint used to provision nodes. | + | |
| - | * **Base View** is the main GUI; **cmsh** is the CLI. | + | |
| - | * **Pre-job health check failure** can block a job from running on a node. | + | |
| - | * **cm-scale** supports dynamic scaling and node repurposing. | + | |
| - | * **Cluster Extension** keeps the head node outside the cloud while adding cloud capacity. | + | |
| - | * **cuda-dcgm** is important for NVIDIA GPU metrics and GPU-aware monitoring. ([NVIDIA Docs][2]) | + | |
| + | [[wiki: | ||
| + | [[wiki: | ||
| - | [1]: https:// | ||
| - | [2]: https:// | ||
| - | [3]: https:// | ||
| - | [4]: https:// | ||
| - | [5]: https:// | ||
| - | [6]: https:// | ||
| - | [7]: https:// | ||