The Container Engine (CE) toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments.
Containers effectively encapsulate a software stack; however, to be useful in HPC computing environments, they often require the customization of bind mounts, environment variables, working directories, hooks, plugins, etc. To simplify this process, the Container Engine (CE) toolset supports the specification of user environments through Environment Definition Files.
An Environment Definition File (EDF) is a text file in the TOML format that declaratively and prescriptively represents the creation of a computing environment based on a container image. Users can create their own custom environments and share, edit, or build upon already existing environments.
The Container Engine (CE) toolset leverages its tight integration with the Slurm workload manager to parse EDFs directly from the command line or batch script and instantiate containerized user environments seamlessly and transparently.
Through the EDF, container use cases can be abstracted to the point where end users perform their workflows as if they were operating natively on the computing system.
Key Benefits:
Let's set up a containerized Ubuntu 24.04 environment using a scratch folder as the working directory.
image = "library/ubuntu:24.04" mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"] workdir = "/capstor/scratch/cscs/<username>" |
Note: Enter your actual username insead of the <username>
placeholder in the example above
Save this file as ubuntu.toml
file in $HOME/.edf
directory (which is the default location of EDF files). A more detailed explanation of each entry for the EDF can be seen in the EDF reference.
Use Slurm in the cluster login node to start the Ubuntu environment that was just defined as follows:
$ srun --environment=ubuntu --pty bash |
Since the ubuntu.toml
file is located in the EDF search path, the filename can be passed to the option without the file extension.
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=ubuntu --pty bash <username>@<node name>:/capstor/scratch/cscs/<username>$ pwd /capstor/scratch/cscs/<username> <username>@<node name>:/capstor/scratch/cscs/<username>$ cat /etc/os-release PRETTY_NAME="Ubuntu 24.04 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo <username>@<node name>:/capstor/scratch/cscs/<username>$ exit [<vcluster>][<username>@<vcluster>-ln001 ~]$ |
The above terminal snippet demonstrates how to launch a containerized environment using Slurm with the --environment
option, where we highlight:
Line 1. Starting an interactive shell session within the Ubuntu 24.04 container deployed on a compute node using srun --environment=ubuntu --pty bash
.
Line 3. Confirm the working directory inside the container (pwd
) and that it is set to the user's scratch folder, as per EDF.
Line 6. Show the OS version of your container (using cat /etc/os-release
) based on Ubuntu 24.04 LTS.
Line 21. Exiting the container (exit
), returning to the login node.
Note that the image pull and the container start happen automatically, streamlining the usage of the CE.
A job is run in a containerized environment by passing the --environment
option to the srun
or salloc
Slurm commands. The option takes a file path to the EDF describing the environment in which the job should be executed, for example:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=$SCRATCH/edf/debian.toml cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" |
--environment
can be a relative path from the current working directory (i.e., where the Slurm command is entered). A relative path should be prepended by ./
. For example:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ ls debian.toml [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=./debian.toml cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" |
If a file is located in the EDF search path, the argument to the command line option can be just the environment name, that is the name of the file without the .toml
extension, for example:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=debian cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" |
In principle, the --environment
option can also be used within batch scripts as an #SBATCH
option.
It is important to note that in such a case, all the contents of the script are executed within the containerized environment: the CE toolset gives access to the Slurm workload manager within containers via the Slurm hook, see section Container Hooks (controlled by the ENROOT_SLURM_HOOK
environment variable and activated by default on most vClusters). Only with it, calls to Slurm commands (for example srun
or scontrol
) within the batch script will work.
For the time being, if the script requires to invoke Slurm commands, the recommended approach is to use --environment
as part of the commands, for example, when launching job steps:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat example.sbatch #!/bin/bash -l #SBATCH --job-name=edf-example #SBATCH --time=0:01:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --partition=<vcluster> #SBATCH --output=slurm-%x.out # Run job step srun --environment=debian cat /etc/os-release |
By default, the EDFs for each user are looked up in $HOME/.edf
. The search path for EDFs can be controlled through the EDF_PATH
environment variable. EDF_PATH
must be a colon-separated list of absolute paths to directories where the CE looks for TOML files, similar to the PATH
and LD_LIBRARY_PATH
variables. If a file is located in the search path, its name can be used in --environment
options without the .toml
extension, for example:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ ls -l ~/.edf total 8 -rw-r--r-- 1 <username> csstaff 27 Sep 6 15:19 debian.toml [<vcluster>][<username>@<vcluster>-ln001 ~]$ ls -l ~/example-project/ total 4 -rw-r-----+ 1 <username> csstaff 28 Oct 26 17:44 fedora-env.toml [<vcluster>][<username>@<vcluster>-ln001 ~]$ export EDF_PATH=$HOME/example-project/ [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=fedora-env cat /etc/os-release NAME="Fedora Linux" VERSION="40 (Container Image)" ID=fedora VERSION_ID=40 VERSION_CODENAME="" PLATFORM_ID="platform:f40" PRETTY_NAME="Fedora Linux 40 (Container Image)" [...] |
The image caching functionality is only available on the Bristen vCluster as technical preview. |
By default, images defined in the EDF as remote registry references (e.g. a Docker reference) are automatically pulled and locally cached. A cached image would be preferred to pulling the image again in later usage.
An image cache is automatically created at .edf_imagestore
in the user's scratch folder (i.e., $SCRATCH/.edf_imagestore
), under which cached images are stored in a corresponding CPU architecture subfolder (e.g., x86
and aarch64
). Users should regularly remove unused cache images to limit the cache size.
Should users want to re-pull a cached image, they have to remove the corresponding image in the cache.
To choose an alternative image store path (e.g., to use a directory owned by a group and not to an individual user), users can specify an image cache path explicitly by defining the environment variable EDF_IMAGESTORE
. EDF_IMAGESTORE
must be an absolute path to an existing folder.
If the CE cannot create a directory for the image cache, it operates in cache-free mode, meaning that it pulls an ephemeral image before every container launch and discards it upon termination. |
To work with images stored from the NGC Catalog, please refer also to the next section "Using images from third party registries and private repositories". |
To bypass any caching behavior, users can manually pull an image and directly plug it into their EDF. To do so, users may execute enroot import docker://[REGISTRY#]IMAGE[:TAG]
to pull container images from OCI registries to the current directory.
For example, the command below pulls an nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
image.
enroot import docker://nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 |
[<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun enroot import docker://nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 [INFO] Querying registry for permission grant [INFO] Authenticating with user: <anonymous> [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 13 missing layers... [INFO] Extracting image layers... [INFO] Converting whiteouts... [INFO] Creating squashfs filesystem... Parallel mksquashfs: Using 64 processors Creating 4.0 filesystem on /scratch/aistor/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh, block size 131072. Exportable Squashfs 4.0 filesystem, zstd compressed, data block size 131072 uncompressed data, compressed metadata, compressed fragments, compressed xattrs, compressed ids duplicates are removed Filesystem size 9492185.87 Kbytes (9269.71 Mbytes) 98.93% of uncompressed filesystem size (9594893.12 Kbytes) Inode table size 128688 bytes (125.67 Kbytes) 17.47% of uncompressed inode table size (736832 bytes) Directory table size 132328 bytes (129.23 Kbytes) 46.42% of uncompressed directory table size (285091 bytes) Number of duplicate files found 1069 Number of inodes 13010 Number of files 10610 Number of fragments 896 Number of symbolic links 846 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 1554 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0) |
After the import is complete, images are available in Squashfs format in the current directory and can be used in EDFs, for example:
[<vcluster>][<username>@<vcluster>-ln001 <username>]$ ls -l *.sqsh -rw-r--r-- 1 <username> csstaff 9720037376 Sep 11 14:46 nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh [<vcluster>][<username>@<vcluster>-ln001 <username>]$ realpath nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh /capstor/scratch/cscs/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh [<vcluster>][<username>@<vcluster>-ln001 <username>]$ cat $HOME/.edf/cudnn8.toml image = "/capstor/scratch/cscs/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh" |
It is recommended to save images in |
Docker Hub is the default registry from which remote images are imported.
To use an image from a different registry, the corresponding registry URL has to be prepended to the image reference, using a hash character (#
) as a separator. For example:
# Usage within an EDF [<vcluster>][<username>@<vcluster>-ln001 <username>]$ cat $HOME/.edf/nvhpc-23.7.toml image = "nvcr.io#nvidia/nvhpc:23.7-runtime-cuda11.8-ubuntu22.04" # Usage on the command line [<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun enroot import docker://nvcr.io#nvidia/nvhpc:23.7-runtime-cuda11.8-ubuntu22.04 |
To import images from private repositories, access credentials should be configured by individual users in the $HOME/.config/enroot/.credentials
file, following the netrc file format.
Using the enroot import
documentation page as a reference, some examples could be:
# NVIDIA NGC catalog (both endpoints are required) machine nvcr.io login $oauthtoken password <token> machine authn.nvidia.com login $oauthtoken password <token> # DockerHub machine auth.docker.io login <login> password <password> # Google Container Registry with OAuth machine gcr.io login oauth2accesstoken password $(gcloud auth print-access-token) # Google Container Registry with JSON machine gcr.io login _json_key password $(jq -c '.' $GOOGLE_APPLICATION_CREDENTIALS | sed 's/ /\\u0020/g') # Amazon Elastic Container Registry machine 12345.dkr.ecr.eu-west-2.amazonaws.com login AWS password $(aws ecr get-login-password --region eu-west-2) # Azure Container Registry with ACR refresh token machine myregistry.azurecr.io login 00000000-0000-0000-0000-000000000000 password $(az acr login --name myregistry --expose-token --query accessToken | tr -d '"') # Azure Container Registry with ACR admin user machine myregistry.azurecr.io login myregistry password $(az acr credential show --name myregistry --subscription mysub --query passwords[0].value | tr -d '"') # Github.com Container Registry (GITHUB_TOKEN needs read:packages scope) machine ghcr.io login <username> password <GITHUB_TOKEN> # GitLab Container Registry (GITLAB_TOKEN needs a scope with read access to the container registry) # GitLab instances often use different domains for the registry and the authentication service, respectively # Two separate credential entries are required in such cases, for example: # Gitlab.com machine registry.gitlab.com login <username> password <GITLAB TOKEN> machine gitlab.com login <username> password <GITLAB TOKEN> # ETH Zurich GitLab registry machine registry.ethz.ch login <username> password <GITLAB_TOKEN> machine gitlab.ethz.ch login <username> password <GITLAB_TOKEN> |
Annotations define arbitrary metadata for containers in the form of key-value pairs. Within the EDF, annotations are designed to be similar in appearance and behavior to those defined by the OCI Runtime Specification. Annotation keys usually express a hierarchical namespace structure, with domains separated by ".
" (full stop) characters.
As annotations are often used to control hooks, they have a deep nesting level. For example, to execute the SSH hook described below, the annotation com.hooks.ssh.enabled
must be set to the string "true"
.
EDF files support setting annotations through the annotations
table. This can be done in multiple ways in TOML: for example, both of the following usages are equivalent:
Case: nest levels in the TOML key.
[annotations] com.hooks.ssh.enabled = "true" |
Case: nest levels in the TOML table name.
[annotations.com.hooks.ssh] enabled = "true" |
To avoid mistakes, notice a few key features of TOML:
ssh
inside hooks
, which in turn is inside com
, which is inside annotations
).com.hooks.ssh.enabled
attribute within the [annotations]
table is exactly equivalent to assigning to the enabled
attribute within the [annotations.com.hooks.ssh]
subtable.[annotations.com.hooks.ssh] authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys" enabled = "true" |
[annotations] com.hooks.ssh.authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys" com.hooks.ssh.enabled = "true" |
[annotations] com.hooks.ssh.authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys" [annotations.com.hooks.ssh] enabled = "true" |
ssh
table is defined (gets attributes set) both in the [annotations]
and in the [annotations.com.hooks.ssh]
sections. See the TOML format spec for more detailsThe Container Engine leverages components from the NVIDIA Container Toolkit to expose NVIDIA GPU devices inside containers.
GPU device files are always mounted in containers, and the NVIDIA driver user space components are mounted if the NVIDIA_VISIBLE_DEVICES
environment variable is not empty, unset or set to "void"
. NVIDIA_VISIBLE_DEVICES
is already set in container images officially provided by NVIDIA to enable all GPUs available on the host system. Such images are frequently used to containerize CUDA applications, either directly or as a base for custom images, thus in many cases no action is required to access GPUs.
For example, on a cluster with 4 GH200 devices per compute node:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat .edf/cuda12.5.1.toml image = "nvidia/cuda:12.5.1-devel-ubuntu24.04" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=cuda12.5.1 nvidia-smi Thu Oct 26 17:59:36 2023 +------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.5 | |--------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |======================================+======================+======================| | 0 GH200 120GB On | 00000009:01:00.0 Off | 0 | | N/A 24C P0 89W / 900W | 37MiB / 97871MiB | 0% E. Process | | | | Disabled | +--------------------------------------+----------------------+----------------------+ | 1 GH200 120GB On | 00000019:01:00.0 Off | 0 | | N/A 24C P0 87W / 900W | 37MiB / 97871MiB | 0% E. Process | | | | Disabled | +--------------------------------------+----------------------+----------------------+ | 2 GH200 120GB On | 00000029:01:00.0 Off | 0 | | N/A 24C P0 83W / 900W | 37MiB / 97871MiB | 0% E. Process | | | | Disabled | +--------------------------------------+----------------------+----------------------+ | 3 GH200 120GB On | 00000039:01:00.0 Off | 0 | | N/A 24C P0 85W / 900W | 37MiB / 97871MiB | 0% E. Process | | | | Disabled | +--------------------------------------+----------------------+----------------------+ +------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |====================================================================================| | No running processes found | +------------------------------------------------------------------------------------+ |
It is possible to use environment variables to control which capabilities of the NVIDIA driver are enabled inside containers.
Additionally, the NVIDIA Container Toolkit can enforce specific constraints for the container, for example, on versions of the CUDA runtime or driver, or on the architecture of the GPUs.
For the full details about using these features, please refer to the official documentation: Driver Capabilities, Constraints.
The Container Engine provides a hook to allow containers relying on libfabric to leverage the HPE Slingshot 11 high-speed interconnect. This component is commonly referred to as the "CXI hook", taking its name from the CXI libfabric provider required to interface with Slingshot 11.
The hook leverages bind-mounting the custom host libfabric library into the container (in addition to all the required dependency libraries and devices as well).
If a libfabric library is already present in the container filesystem (for example, it's provided by the image), it is replaced with its host counterpart, otherwise the host libfabric is just added to the container.
Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook. Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also this section for more details). |
The hook is activated by setting the com.hooks.cxi.enabled
annotation, which can be defined in the EDF, as shown in the following example:
# Without the CXI hook [<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/osu-mb.toml image = "quay.io#madeeks/osu-mb:6.2-mpich4.1-ubuntu22.04-arm64" [annotations] com.hooks.cxi.enabled = "false" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -N2 --mpi=pmi2 --environment=osu-mb ./osu_bw # OSU MPI Bandwidth Test v6.2 # Size Bandwidth (MB/s) 1 0.22 2 0.40 4 0.90 8 1.82 16 3.41 32 6.81 64 13.18 128 26.74 256 11.95 512 38.06 1024 39.65 2048 83.22 4096 156.14 8192 143.08 16384 53.78 32768 106.77 65536 49.88 131072 871.86 262144 780.97 524288 694.58 1048576 831.02 2097152 1363.30 4194304 1279.54 # With the CXI hook enabling access to the Slingshot high-speed network [<vcluster>][<username>@<vcluster>-ln001 ~]$ cat .edf/osu-mb-cxi.toml image = "quay.io#madeeks/osu-mb:6.2-mpich4.1-ubuntu22.04" [annotations] com.hooks.cxi.enabled = "true" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -N2 --mpi=pmi2 --environment=osu-mb-cxi ./osu_bw # OSU MPI Bandwidth Test v6.2 # Size Bandwidth (MB/s) 1 1.21 2 2.32 4 4.85 8 8.38 16 19.36 32 38.47 64 76.28 128 151.76 256 301.25 512 604.17 1024 1145.03 2048 2367.25 4096 4817.16 8192 8633.36 16384 16971.18 32768 18740.55 65536 21978.65 131072 22962.31 262144 23436.78 524288 23672.92 1048576 23827.78 2097152 23890.95 4194304 23925.61 |
On several vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks. |
Container hooks let you customize container behavior to fit system-specific needs, making them especially valuable for High-Performance Computing.
This section outlines all hooks supported in production by the Container Engine. However, specific Alps vClusters may support only a subset or use custom configurations. For details about available features in individual vClusters, consult platform documentation or contact CSCS support. |
The AWS OFI NCCL plugin is a software extension that allows the NCCL and RCCL libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the com.hooks.aws_ofi_nccl.variant
annotation is used to specify a plugin variant suitable for a given container image.
At the moment of writing, 4 plugin variants are configured: cuda11
, cuda12
(to be used on NVIDIA GPU nodes), rocm5
, and rocm6
(to be used on AMD GPU nodes alongside RCCL).
For example, the following EDF enables the hook and uses it to mount the plugin in a CUDA 11 image:
image = "nvcr.io#nvidia/pytorch:22.12-py3" mounts = ["/capstor/scratch/cscs/amadonna:/capstor/scratch/cscs/amadonna"] entrypoint = false [annotations] com.hooks.aws_ofi_nccl.enabled = "true" com.hooks.aws_ofi_nccl.variant = "cuda11" |
The AWS OFI NCCL hook also takes care of the following aspects:
NCCL_NET_PLUGIN
variable is set to force NCCL to load the specific network plugin mounted by the hook. This is useful because certain container images (for example, those from NGC repositories) might already ship with a default NCCL plugin. Other environment variables help prevent application stalls and improve performance when using GPUDirect for RDMA communication.The SSH hook runs a lightweight, statically-linked SSH server (a build of Dropbear) inside the container. It can be useful to add SSH connectivity to containers (for example, enabling remote debugging) without bundling an SSH server into the container image or creating ad-hoc image variants for such purposes.
The com.hooks.ssh.authorize_ssh_key
annotation allows the authorization of a custom public SSH key for remote connections. The annotation value must be the absolute path to a text file containing the public key (just the public key without any extra signature/certificate). After the container starts, it is possible to get a remote shell inside the container by connecting with SSH to the listening port.
By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the com.hooks.ssh.port
annotation in the EDF.
It is required to keep the container writable (true by default) to use the hook.
The following EDF file shows an example of enabling the SSH hook and authorizing a user-provided public key:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/ubuntu-ssh.toml image = "ubuntu:latest" [annotations.com.hooks.ssh] enabled = "true" authorize_ssh_key = "<public key file>" |
Using the previous EDF, a container can be started as follows. Notice that the --pty
option for the srun
command is currently required in order for the hook to initialize properly:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=ubuntu-ssh --pty <command> |
While the container is running, it's possible to connect to it from a remote host using a private key matching the public one authorized in the EDF annotation. For example, in a host where such private key is the default identity file, the following command could be used:
ssh -p 15263 <host-of-container> |
In order to establish connections through Visual Studio Code Remote - SSH extension, the |
On several Alps vClusters, NVIDIA GPUs by default operate in "Exclusive process" mode, that is, the CUDA driver is configured to allow only one process at a time to use a given GPU.
For example, on a node with 4 GPUs, a maximum of 4 CUDA processes can run at the same time:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ nvidia-smi -L GPU 0: GH200 120GB (UUID: GPU-...) GPU 1: GH200 120GB (UUID: GPU-...) GPU 2: GH200 120GB (UUID: GPU-...) GPU 3: GH200 120GB (UUID: GPU-...) # This EDF uses the CUDA vector addition sample from NVIDIA's NGC catalog [<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/vectoradd-cuda.toml image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" # 4 processes run successfully [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n4 --environment=vectoradd-cuda /cuda-samples/vectorAdd | grep "Test PASSED" Test PASSED Test PASSED Test PASSED Test PASSED # More than 4 concurrent processes result in oversubscription errors [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n5 --environment=vectoradd-cuda /cuda-samples/vectorAdd | grep "Test PASSED" Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)! srun: error: [...] [...] |
In order to run multiple processes concurrently on the same GPU (one example could be running multiple MPI ranks on the same device), the NVIDIA CUDA Multi-Process Service (or MPS, for short) must be started on the compute node.
The Container Engine provides a hook to automatically manage the setup and removal of the NVIDIA CUDA MPS components within containers.
The hook can be activated by setting the com.hooks.nvidia_cuda_mps.enabled
to the string true
.
It is required to keep the container writable (true by default) to be able to use the hook.
The following is an example of using the NVIDIA CUDA MPS hook:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/vectoradd-cuda-mps.toml image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" [annotations] com.hooks.nvidia_cuda_mps.enabled = "true" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n8 --environment=vectoradd-cuda-mps /cuda-samples/vectorAdd | grep "Test PASSED" | wc -l 8 |
When using the NVIDIA CUDA MPS hook it is not necessary to use other wrappers or scripts to manage the Multi-Process Service, as is documented for native jobs on some vClusters. |
EDF files use the TOML format. For details about the data types used by the different parameters, please refer to the TOML spec webpage.
Parameter | Description | |||||
---|---|---|---|---|---|---|
| Ordered list of EDFs that this file inherits from. Parameters from listed environments are evaluated sequentially. Supports up to 10 levels of recursion.
| |||||
Examples
| ||||||
| The container image to use. Can reference a remote Docker/OCI registry or a local Squashfs file as a filesystem path.
| |||||
Examples
| ||||||
| Initial working directory when the container starts. | |||||
Examples
| ||||||
| If | |||||
Examples
| ||||||
| If | |||||
Examples
| ||||||
| List of bind mounts in the format
| |||||
Examples
| ||||||
| Environment variables to set in the container. Null-string values will unset the variable.
| |||||
Examples
| ||||||
| ||||||
Examples
|
Environment variable expansion and relative paths expansion are only available on the Bristen vCluster as technical preview. |
Environment variable expansion allows for dynamic substitution of environment variable values within the EDF (Environment Definition File). This capability applies across all configuration parameters in the EDF, providing flexibility in defining container environments.
${VAR}
to reference an environment variable VAR
. The variable's value is resolved from the combined environment, which includes variables defined in the host and the container image, the later taking precedence.${SCRATCH}
can be used in mounts to reference a directory path.${VAR:-}
, which evaluates to an empty string if VAR
is undefined.$$
. For example, $${VAR}
will render as the literal string ${VAR}
.[env]
EDF table cannot reference other entries from [env]
tables in the same or other EDF files (e.g. the ones entered as base environments) . Therefore, only environment variables from the host can be referenced.[env]
.Relative filesystem paths can be used within EDF parameters, and will be expanded by the CE at runtime. The paths are interpreted as relative to the working directory of the process calling the CE, not to the location of the EDF file.
Alpine Linux is incompatible with some hooks, causing errors when used with Slurm. For example,
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat alpine.toml image = "alpine:3.19" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -lN1 --environment=alpine.toml echo "abc" 0: slurmstepd: error: pyxis: container start failed with error code: 1 0: slurmstepd: error: pyxis: printing enroot log file: 0: slurmstepd: error: pyxis: [ERROR] Failed to refresh the dynamic linker cache 0: slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/87-slurm.sh exited with return code 1 0: slurmstepd: error: pyxis: couldn't start container 0: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 0: slurmstepd: error: Failed to invoke spank plugin stack |
This is because some hooks (e.g., Slurm and CXI hooks) leverage ldconfig
(from Glibc) when they bind-mount host libraries inside containers; since Alpine Linux provides an alternative ldconfig
(from Musl Libc), it does not work as intended by hooks. As a workaround, users may disable problematic hooks. For example,
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat alpine_workaround.toml image = "alpine:3.19" [annotations] com.hooks.slurm.enabled = "false" com.hooks.cxi.enabled = "false" [<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -lN1 --environment=alpine_workaround.toml echo "abc" abc |
Notice the section [annotations]
disabling Slurm and CXI hooks.
We're aware of an issue affecting containers with open source MPICH libraries from using shared memory when launching multiple ranks per node. We have identified the cause of the issue and are working to apply a fix.
The problem usually manifests with the following error message:
Abort(73000719) on node 5: Fatal error in internal_Init: Other MPI error, error stack: internal_Init(48306)..........: MPI_Init(argc=0xffffeba3326c, argv=0xffffeba33260) failed MPII_Init_thread(265).........: MPIR_init_comm_world(34)......: MPIR_Comm_commit(800).........: MPIR_Comm_commit_internal(585): MPID_Comm_commit_pre_hook(151): MPIDI_world_pre_init(633).....: MPIDU_Init_shm_init(179)......: unable to allocate shared memory |
The issue should affect only intra-node communication, not inter-node communication. In other words, running 1 MPI rank per node should work correctly.
A temporary workaround consists in ensuring that the PMI_RANK
environment variable inside the container corresponds to the actual MPI rank of the process within the job step.
This can be achieved by wrapping the original command for the container within a bash command like shown below:
srun <srun options> --environment=<name or path> bash -c 'export PMI_RANK=${SLURM_PROCID}; <regular container command>' |