The Container Engine (CE) toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments.

Concept

Container images effectively encapsulate a software stack, however in order to become usable HPC computing environments they often require the customization of bind mounts, environment variables, working directory, hooks, plugins and so on.
To bridge this gap, the Container Engine (CE) toolset supports the specification of user environments through Environment Definition Files.

An Environment Definition File (EDF) is a text file in the TOML format which represents in a declarative and prescriptive fashion the creation of a computing environment based on a container image. Users can create their custom environments, and have the possibility to share, edit, or build upon already existing environments.

The Container Engine (CE) toolset leverages its tight integration with the Slurm workload manager to parse EDFs directly from the command line or batch script and to instantiate containerized user environments in a seamless and transparent way.

Through the EDF, container use cases can be abstracted to the point where end users perform their workflows as if they were operating natively on the computing system.

Key Benefits:

Freedom: Container gives full control of the user space to the user, who can decide what to install without involving a sysadmin.

Reproducibility: Workloads consistently run in the same environment, ensuring uniformity across job experimental runs.

Portability: The self-contained nature of containers simplifies the deployment across architecture-compatible HPC systems.

Seamless Access to HPC Resources: CE facilitates native access to specialized HPC resources like GPUs, interconnects, and other system-specific tools crucial for performance

Quickstart

Let's create a simple environment based on the official Docker Hub image for Ubuntu 22.04 and having a user's scratch folder as initial working directory.
An Environment Definition File (EDF) which describes this use case could have the following contents. Let's write such a file in the $HOME/.edf directory, which is the default location of EDF files for each user, calling it ubuntu.toml.

image = "library/ubuntu:24.04"
mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"]
workdir = "/capstor/scratch/cscs/<username>"

- The image parameter is a basic string representing an image reference on a registry, or a path to a local image file.
- The mounts parameter is a TOML array of strings representing bind mounts: every string must specify at least a source path (on the host) and a destination path (inside the container), separated by a colon.
- The workdir parameter is a basic string representing the path of the initial working directory in the container.
We can now run within this environment by using the --environment option in the Slurm command line. Since the file is located in the EDF search path, the filename can be passed to the option without the file extension:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=ubuntu --pty bash

<username>@<node name>:/capstor/scratch/cscs/<username>$ pwd
/capstor/scratch/cscs/<username>

<username>@<node name>:/capstor/scratch/cscs/<username>$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

<username>@<node name>:/capstor/scratch/cscs/<username>$ exit

[<vcluster>][<username>@<vcluster>-ln001 ~]$

Notice that the image pull and the container start happened automatically.

Running containerized environments

A job is run in a containerized environment by passing the --environment option to the srun or salloc Slurm commands. The option takes one mandatory argument, which is the absolute path to the EDF describing the environment in which the job should be executed, for example:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=$SCRATCH/edf/debian.toml cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

If a file is located in the EDF search path, the argument to the command line option can be just the environment name, that is the name of the file without the .toml extension, for example:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=debian cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Using containerized environments in batch scripts

In principle, the --environment option can also be used within batch scripts as an #SBATCH option.
It is important to note that in such a case all the contents of the script are executed within the containerized environment: the CE toolset gives access to the Slurm workload manager within containers via the slurm hook (controlled by the ENROOT_SLURM_HOOK environment variable, and activatd by default on most vClusters). Only with it calls to Slurm commands (for example srun or scontrol) within the batch script will work.

For the time being, if the script requires to invoke Slurm commands, the recommended approach is to use --environment as part of the commands, for example when launching job steps:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat example.sbatch 
#!/bin/bash -l
#SBATCH --job-name=edf-example
#SBATCH --time=0:01:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --partition=<vcluster>
#SBATCH --output=slurm-%x.out
 
# Run job step
srun --environment=debian cat /etc/os-release

The EDF search path

By default, the EDFs for each user are looked up in $HOME/.edf. The search path for EDFs can be controlled through the EDF_PATH environment variable. EDF_PATH must be a colon-separated list of absolute paths to directories where the CE looks for TOML files, similarly to the PATH and LD_LIBRARY_PATH variables. If a file is located in the search path, its name can be used in --environment options without the .toml extension, for example:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ ls -l ~/.edf
total 8
-rw-r--r-- 1 <username> csstaff  27 Sep  6 15:19 debian.toml

[<vcluster>][<username>@<vcluster>-ln001 ~]$ ls -l ~/example-project/
total 4
-rw-r-----+ 1 <username> csstaff 28 Oct 26 17:44 fedora-env.toml

[<vcluster>][<username>@<vcluster>-ln001 ~]$ export EDF_PATH=$HOME/example-project/

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=fedora-env cat /etc/os-release
NAME="Fedora Linux"
VERSION="40 (Container Image)"
ID=fedora
VERSION_ID=40
VERSION_CODENAME=""
PLATFORM_ID="platform:f40"
PRETTY_NAME="Fedora Linux 40 (Container Image)"
[...]

Pulling images explicitly

To work with images stored from the NGC Catalog, please refer also to the next section "Using images from third party registries and private repositories".

Using registry reference strings (for example library/ubuntu:22.04) to indicate images in EDFs results in each job importing the container image at the start and discarding it upon termination.

The CE caches layers to minimize download times, however, when using large images, the time required to complete the import process can still be noticeable. In such cases, it is convenient to pull an image explicitly beforehand and refer to them in the EDF using filesystem paths. By doing so, the containerized environment can start quickly, re-using the image immediately without having to retrieve it.

Images needs to be pulled explicitly using the enroot utility and its import command. The format to pull container images from Docker and OCI registries is "enroot import docker://[REGISTRY#]IMAGE[:TAG]" for example:

enroot import docker://nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

Example: image import with full output

[<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun enroot import docker://nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 13 missing layers...
[INFO] Extracting image layers...
[INFO] Converting whiteouts...
[INFO] Creating squashfs filesystem...
Parallel mksquashfs: Using 64 processors
Creating 4.0 filesystem on /scratch/aistor/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh, block size 131072.

Exportable Squashfs 4.0 filesystem, zstd compressed, data block size 131072
	uncompressed data, compressed metadata, compressed fragments,
	compressed xattrs, compressed ids
	duplicates are removed
Filesystem size 9492185.87 Kbytes (9269.71 Mbytes)
	98.93% of uncompressed filesystem size (9594893.12 Kbytes)
Inode table size 128688 bytes (125.67 Kbytes)
	17.47% of uncompressed inode table size (736832 bytes)
Directory table size 132328 bytes (129.23 Kbytes)
	46.42% of uncompressed directory table size (285091 bytes)
Number of duplicate files found 1069
Number of inodes 13010
Number of files 10610
Number of fragments 896
Number of symbolic links  846
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 1554
Number of ids (unique uids + gids) 1
Number of uids 1
	root (0)
Number of gids 1
	root (0)

After the import is complete, images are available in Squashfs format in the current directory and can be used in EDFs, for example:

[<vcluster>][<username>@<vcluster>-ln001 <username>]$ ls -l *.sqsh
-rw-r--r-- 1 <username> csstaff 9720037376 Sep 11 14:46 nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh

[<vcluster>][<username>@<vcluster>-ln001 <username>]$ realpath nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh  /capstor/scratch/cscs/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh

[<vcluster>][<username>@<vcluster>-ln001 <username>]$ cat $HOME/.edf/cudnn8.toml 
image = "/capstor/scratch/cscs/<username>/nvidia+cuda+11.8.0-cudnn8-devel-ubuntu22.04.sqsh"

Using images from third party registries and private repositories

Docker Hub is the default registry from which remote images are imported.

To use an image from a different registry, the corresponding registry URL has to be prepended to the image reference, using a hash character (#) as separator. For example:

# Usage within an EDF
[<vcluster>][<username>@<vcluster>-ln001 <username>]$ cat $HOME/.edf/nvhpc-23.7.toml
image = "nvcr.io#nvidia/nvhpc:23.7-runtime-cuda11.8-ubuntu22.04"

# Usage on the command line
[<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun enroot import docker://nvcr.io#nvidia/nvhpc:23.7-runtime-cuda11.8-ubuntu22.04

To import images from private repositories, access credentials should be configured by individual users in the $HOME/.config/enroot/.credentials file, following the netrc file format.
Using the enroot import documentation page as reference, some examples could be:

# NVIDIA NGC catalog (both endpoints are required)
machine nvcr.io login $oauthtoken password <token>
machine authn.nvidia.com login $oauthtoken password <token>

# DockerHub
machine auth.docker.io login <login> password <password>

# Google Container Registry with OAuth
machine gcr.io login oauth2accesstoken password $(gcloud auth print-access-token)
# Google Container Registry with JSON
machine gcr.io login _json_key password $(jq -c '.' $GOOGLE_APPLICATION_CREDENTIALS | sed 's/ /\\u0020/g')

# Amazon Elastic Container Registry
machine 12345.dkr.ecr.eu-west-2.amazonaws.com login AWS password $(aws ecr get-login-password --region eu-west-2)

# Azure Container Registry with ACR refresh token
machine myregistry.azurecr.io login 00000000-0000-0000-0000-000000000000 password $(az acr login --name myregistry --expose-token --query accessToken  | tr -d '"')
# Azure Container Registry with ACR admin user
machine myregistry.azurecr.io login myregistry password $(az acr credential show --name myregistry --subscription mysub --query passwords[0].value | tr -d '"')

# Github.com Container Registry (GITHUB_TOKEN needs read:packages scope)
machine ghcr.io login <username> password <GITHUB_TOKEN>

# GitLab Container Registry (GITLAB_TOKEN needs a scope with read access to the container registry)
# GitLab instances often use different domains for the registry and the authentication service, respectively
# Two separate credential entries are required in such cases, for example:
# Gitlab.com
machine registry.gitlab.com login <username> password <GITLAB TOKEN>
machine gitlab.com login <username> password <GITLAB TOKEN>

# ETH Zurich GitLab registry
machine registry.ethz.ch login <username> password <GITLAB_TOKEN>
machine gitlab.ethz.ch login <username> password <GITLAB_TOKEN>

Annotations

Annotations define arbitrary metadata for containers in the form of key-value pairs. Within the EDF, annotations are designed to be similar in appearance and behavior to those defined by the OCI Runtime Specification. Annotation keys usually express a hierarchical namespace structure, with domains separated by "." (full stop) characters.

Annotations are often used to enable or control a hook, and are deeply nested. For example, to execute the SSH hook described below the annotation com.hooks.ssh.enabled must be set to the string true.

EDF files support setting annotations through the annotations table. This can be done in multiple ways in TOML: for example, both

[annotations]
com.hooks.ssh.enabled = "true"

and

[annotations.com.hooks.ssh]
enabled = "true"

are valid syntax and have the same effect.

It is important to understand few features of TOML to avoid mistakes:

all assignments of properties belong to the section just before them (the statement in square brackets) which defines the table they refer to.
tables do not automatically belong to the tables declared before them; to nest tables, their name has to list their parents using the dot notations (so the previous example defines the table ssh inside hooks, which in turn is inside com, which is inside annotations).
an assignment can implicitly define subtables if the key you assign is a dotted list. As a reference, see the examples made earlier in this section, where assigning a string to the com.hooks.ssh.enabled attribute within the [annotations] table is exactly equivalent to assigning to the enabled attribute within the [annotations.com.hooks.ssh] subtable.

attributes can be added to a table only in one place in the TOML file, i.e. each table definition must take place in a single square bracket section. This means that while

[annotations.com.hooks.ssh]
authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys"
enabled = "true"

and

[annotations]
com.hooks.ssh.authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys"
com.hooks.ssh.enabled = "true"

are both fine,

[annotations]
com.hooks.ssh.authorize_ssh_key = "/capstor/scratch/cscs/<username>/tests/edf/authorized_keys"
[annotations.com.hooks.ssh]
enabled = "true"

is not because the ssh table is defined (gets attributes set) both in the [annotations] and in the [annotations.com.hooks.ssh] sections. See the TOML format spec for more details

Accessing native resources

NVIDIA GPUs

The Container Engine leverages components from the NVIDIA Container Toolkit to expose NVIDIA GPU devices inside containers.
GPU device files are always mounted in containers, and the NVIDIA driver user space components are mounted if the NVIDIA_VISIBLE_DEVICES environment variable is not empty, unset or set to "void". NVIDIA_VISIBLE_DEVICES is already set in container images officially provided by NVIDIA to enable all GPUs available on the host system. Such images are frequently used to containerize CUDA applications, either directly or as a base for custom images, thus in many cases no action is required to access GPUs.
For example, on a cluster with 4 GH200 devices per compute node:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat .edf/cuda12.5.1.toml 
image = "nvidia/cuda:12.5.1-devel-ubuntu24.04"

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=cuda12.5.1 nvidia-smi
Thu Oct 26 17:59:36 2023       
+------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03          Driver Version: 535.129.03   CUDA Version: 12.5     |
|--------------------------------------+----------------------+----------------------+
| GPU  Name              Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf       Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                      |                      |               MIG M. |
|======================================+======================+======================|
|   0  GH200 120GB                 On  | 00000009:01:00.0 Off |                    0 |
| N/A   24C    P0           89W / 900W |     37MiB / 97871MiB |      0%   E. Process |
|                                      |                      |             Disabled |
+--------------------------------------+----------------------+----------------------+
|   1  GH200 120GB                 On  | 00000019:01:00.0 Off |                    0 |
| N/A   24C    P0           87W / 900W |     37MiB / 97871MiB |      0%   E. Process |
|                                      |                      |             Disabled |
+--------------------------------------+----------------------+----------------------+
|   2  GH200 120GB                 On  | 00000029:01:00.0 Off |                    0 |
| N/A   24C    P0           83W / 900W |     37MiB / 97871MiB |      0%   E. Process |
|                                      |                      |             Disabled |
+--------------------------------------+----------------------+----------------------+
|   3  GH200 120GB                 On  | 00000039:01:00.0 Off |                    0 |
| N/A   24C    P0           85W / 900W |     37MiB / 97871MiB |      0%   E. Process |
|                                      |                      |             Disabled |
+--------------------------------------+----------------------+----------------------+
                                                                                         
+------------------------------------------------------------------------------------+
| Processes:                                                                         |
|  GPU   GI   CI        PID   Type   Process name                         GPU Memory |
|        ID   ID                                                          Usage      |
|====================================================================================|
|  No running processes found                                                        |
+------------------------------------------------------------------------------------+

It is possible to use environment variables to control which capabilities of the NVIDIA driver are enabled inside containers.
Additionally, the NVIDIA Container Toolkit can enforce specific constraints for the container, for example on versions of the CUDA runtime or driver, or on the architecture of the GPUs.
For the full details about using these features, please refer to the official documentation: Driver Capabilities, Constraints.

HPE Slingshot interconnect

The Container Engine provides a hook to allow containers relying on libfabric to leverage the HPE Slingshot 11 high-speed interconnect. This component is commonly referred to as the "CXI hook", taking its name from the CXI libfabric provider required to interface with Slingshot 11.
The hook works by bind mounting into the container a custom libfabric library from the host, plus all the required dependency libraries and devices.
If a libfabric library is already present in the container filesystem (for example, it's provided by the image), it is replaced with its host counterpart, otherwise the host libfabric is just added to the container.

Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI), or could be dynamically available at runtime (as is the case with NCCL - see also this section for more details).

The hook is activated by setting the com.hooks.cxi.enabled annotation, which can be defined in the EDF, as shown in the following example:

# Without the CXI hook
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/osu-mb.toml 
image = "quay.io#madeeks/osu-mb:6.2-mpich4.1-ubuntu22.04-arm64"

[annotations]
com.hooks.cxi.enabled = "false"

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -N2 --mpi=pmi2 --environment=osu-mb ./osu_bw
# OSU MPI Bandwidth Test v6.2
# Size      Bandwidth (MB/s)
1                       0.22
2                       0.40
4                       0.90
8                       1.82
16                      3.41
32                      6.81
64                     13.18
128                    26.74
256                    11.95
512                    38.06
1024                   39.65
2048                   83.22
4096                  156.14
8192                  143.08
16384                  53.78
32768                 106.77
65536                  49.88
131072                871.86
262144                780.97
524288                694.58
1048576               831.02
2097152              1363.30
4194304              1279.54


# With the CXI hook enabling access to the Slingshot high-speed network
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat .edf/osu-mb-cxi.toml 
image = "quay.io#madeeks/osu-mb:6.2-mpich4.1-ubuntu22.04"

[annotations]
com.hooks.cxi.enabled = "true"

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -N2 --mpi=pmi2 --environment=osu-mb-cxi ./osu_bw
# OSU MPI Bandwidth Test v6.2
# Size      Bandwidth (MB/s)
1                       1.21
2                       2.32
4                       4.85
8                       8.38
16                     19.36
32                     38.47
64                     76.28
128                   151.76
256                   301.25
512                   604.17
1024                 1145.03
2048                 2367.25
4096                 4817.16
8192                 8633.36
16384               16971.18
32768               18740.55
65536               21978.65
131072              22962.31
262144              23436.78
524288              23672.92
1048576             23827.78
2097152             23890.95
4194304             23925.61

On several vClusters, the CXI hook for Slingshot connectivity is enabled implicitly by default or by other hooks.
Therefore, in many cases it is not necessary to enter the enabling annotation in the EDF.

Container Hooks

This section documents the complete set of hooks supported in production by the Container Engine.

The various Alps vClusters might deploy only a subset of these hooks or adopt differing configurations.

Please refer to individual platform documentation or contact CSCS support for more details about the features available on specific vClusters.

AWS OFI NCCL Hook

The AWS OFI NCCL plugin is a software extension which allows the NCCL and RCCL libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.

The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the com.hooks.aws_ofi_nccl.variant annotation is used to specify a plugin variant suitable for a given container image.
At the moment of writing, 4 plugin variants are configured: cuda11, cuda12 (to be used on NVIDIA GPU nodes), rocm5, and rocm6 (to be used on AMD GPU nodes alongside RCCL).
For example, the following EDF enables the hook and uses it to mount the plugin in a CUDA 11 image:

image = "nvcr.io/nvidia/pytorch:22.12-py3"
mounts = ["/capstor/scratch/cscs/amadonna:/capstor/scratch/cscs/amadonna"]
entrypoint = false

[annotations]
com.hooks.aws_ofi_nccl.enabled = "true"
com.hooks.aws_ofi_nccl.variant = "cuda11"

The AWS OFI NCCL hook also takes care of the following aspects:

Implicitly enable the CXI hook, therefore exposing the Slingshot interconnect to container applications. In other words, when enabling the AWS OFI NCCL hook it's not necessary to also enable the CXI hook in the EDF.
Set environment variables to control the behavior of NCCL and the libfabric CXI provider for Slingshot. In particular, the NCCL_NET_PLUGIN variable is set to force NCCL to load the specific network plugin mounted by the hook. This is useful because certain container images (for example those from NGC repositories) might already ship with a default NCCL plugin. Other environment variables help prevent application stalls and improve performance when using GPUDirect for RDMA communication.

SSH Hook

The SSH hook runs a lightweight, statically-linked SSH server (a build of Dropbear) inside the container. It can be useful to add SSH connectivity to containers (enabling for example remote debugging) without bundling an SSH server into the container image, or creating ad-hoc image variants for such purpose.

The com.hooks.ssh.authorize_ssh_key annotation allows to authorize a custom public SSH key for remote connections. The annotation value must be the absolute path to a text file which contains the public key (just the public key without any extra signature/certificate). After the container starts, it is possible to get a remote shell inside the container by connecting with SSH to the listening port.

By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the com.hooks.ssh.port annotation in the EDF.

It is required to keep the container writable to be able to use the hook.

The following EDF file shows an example enabling the SSH hook and authorizing a user-provided public key:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/ubuntu-ssh.toml
image = "ubuntu:latest"
writable = true

[annotations.com.hooks.ssh]
enabled = "true"
authorize_ssh_key = "<public key file>"

Using the previous EDF, a container can be started as follows. Notice that the --pty option for the srun command is currently required in order for the hook to initialize properly:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun --environment=ubuntu-ssh --pty <command>

While the container is running, it's possible to connect to it from a remote host using a private key matching the public one authorized in the EDF annotation. For example, in a host where such private key is the default identity file, the following command could be used:

ssh -p 15263 <host-of-container>

In order to establish connections through Visual Studio Code Remote - SSH extension, the scp program must be available within the container. This is required to send and establish the VS Code Server into the remote container.

NVIDIA CUDA MPS Hook

On several Alps vClusters, NVIDIA GPUs by default operate in "Exclusive process" mode, that is, the CUDA driver is configured to allow only one process at a time to use a given GPU.
For example, on a node with 4 GPUs, a maximum of 4 CUDA processes can run at the same time:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ nvidia-smi -L
GPU 0: GH200 120GB (UUID: GPU-...)
GPU 1: GH200 120GB (UUID: GPU-...)
GPU 2: GH200 120GB (UUID: GPU-...)
GPU 3: GH200 120GB (UUID: GPU-...)

# This EDF uses the CUDA vector addition sample from NVIDIA's NGC catalog
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/vectoradd-cuda.toml
image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04"

# 4 processes run successfully
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n4 --environment=vectoradd-cuda /cuda-samples/vectorAdd | grep "Test PASSED"
Test PASSED
Test PASSED
Test PASSED
Test PASSED

# More than 4 concurrent processes result in oversubscription errors
[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n5 --environment=vectoradd-cuda /cuda-samples/vectorAdd | grep "Test PASSED"
Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
srun: error: [...]
[...]

In order to run multiple processes concurrently on the same GPU (one example could be running multiple MPI ranks on the same device), the NVIDIA CUDA Multi-Process Service (or MPS, for short) must be started on the compute node.

The Container Engine provides a hook to automatically manage the setup and removal of the NVIDIA CUDA MPS components within containers.
The hook can be activated by setting the com.hooks.nvidia_cuda_mps.enabled to the string true.

It is required to keep the container writable to be able to use the hook.

The following is an example of using the NVIDIA CUDA MPS hook:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/vectoradd-cuda-mps.toml
image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04"
writable = true

[annotations]
com.hooks.nvidia_cuda_mps.enabled = "true"

[<vcluster>][<username>@<vcluster>-ln001 ~]$ srun -t2 -N1 -n8 --environment=vectoradd-cuda-mps /cuda-samples/vectorAdd | grep "Test PASSED" | wc -l
8

When using the NVIDIA CUDA MPS hook it is not necessary to use other wrappers or scripts to manage the Multi-Process Service, as is documented for native jobs on some vClusters.

Full EDF reference

EDF files use the TOML format. For details about the data types used by the different parameters, please refer to the TOML spec webpage.

base_environment (ARRAY or BASIC STRING): ordered list of EDFs which form the base for the current one.
Parameters from the listed environments are evaluated sequentially, adding new entries or overwriting previous ones, before evaluating the parameters from the current EDF. In other words, the current EDF inherits the parameters from the EDFs listed in base_environment. When evaluating mounts or env parameters, values from downstream EDFs are not overwritten but appended to inherited values.
The individual EDF entries in the array follow the same search rules as the arguments of the --environment CLI option for Slurm: they can be either absolute paths or filenames without extension if the file is located in the EDF search path.
Base environments from the list can have base environments on their own, which are evaluated recursively. A maximum of 10 levels of recursion is currently supported.
This parameter can be a string in the case there is only 1 base environment.
image (BASIC STRING): the container image to use for the containerized environment. Can be a reference to a remote Docker/OCI registry or an absolute filesystem path to a local Squashfs file.
The full format for remote references is [USER@][REGISTRY#]IMAGE[:TAG]. The registry user can also be specified in the $HOME/.config/enroot/.credentials file. The default registry is Docker Hub (docker.io).
workdir (BASIC STRING): the initial working directory in the container. Defaults to the working directory from the container image
entrypoint (BOOL): if true, run the entrypoint from the container image. Default: true.
writable (BOOL): if false, the container filesystem is read-only. Default: false.
mounts (ARRAY): list of bind mounts to be performed within the container.
The format of each array element is SOURCE:DESTINATION[:FLAGS]. Source must be an existing path on the host, while destination is the mount path inside the container. Optional mount flags are separated with a plus symbol, for example "ro+private".
env (TABLE): environment variables to set within the container.
Containers by default inherit environment variables from the container image and the host environment, with variables from the image taking precedence. The env table can be used to further customize the container environment by setting, modifying or unsetting variables.
Values of the table entries must be strings. If an entry has a null value, the variable corresponding to the entry key is unset in the container.
annotations (TABLE): OCI-like annotations to set for the container. For more details, refer to the Annotations section.

Content

Space Tools

Concept

Quickstart

Running containerized environments

Using containerized environments in batch scripts

The EDF search path

Pulling images explicitly

Using images from third party registries and private repositories

Annotations

Accessing native resources

NVIDIA GPUs

HPE Slingshot interconnect

Container Hooks

AWS OFI NCCL Hook

SSH Hook

NVIDIA CUDA MPS Hook

Full EDF reference

Further reading

Content

Space Tools

Breadcrumbs

Container Engine

Concept

Quickstart

Running containerized environments

Using containerized environments in batch scripts

The EDF search path

Pulling images explicitly

Using images from third party registries and private repositories

Annotations

Accessing native resources

NVIDIA GPUs

HPE Slingshot interconnect

Container Hooks

AWS OFI NCCL Hook

SSH Hook

NVIDIA CUDA MPS Hook

Full EDF reference

Further reading