Sarus is a software to run Linux containers on High Performance Computing environments and is included in the CSCS Products Catalogue. Its development has been driven by the specific requirements of HPC systems, while leveraging open standards and technologies to encourage vendor and community involvement.

Key features:

  • Spawning of isolated software environments (containers), built by users to fit the deployment of a specific application

  • Security oriented to HPC systems

  • Extensible runtime by means of OCI hooks to allow current and future support of custom hardware while achieving native performance

  • Creation of container filesystems tailored for diskless nodes and parallel filesystems

  • Compatibility with the presence of a workload manager

  • Compatibility with the Open Container Initiative (OCI) standards:

    • Can pull images from registries adopting the OCI Distribution Specification or the Docker Registry HTTP API V2 protocol (e.g. Docker Hub)
    • Can import and convert images adopting the OCI Image Format (e.g. Docker images)
    • Sets up a container bundle complying to the OCI Runtime Specification
    • Uses an OCI-compliant runtime to spawn the container process

Licensing Terms and Conditions

The source code for Sarus is freely distributed under a BSD 3-Clause license.

Setup

Sarus is available from /usr/bin on Eiger, while you should load daint-gpu or daint-mc before loading the sarus modulefile on Piz Daint as shown in the example below:

module load daint-gpu # or daint-mc
module load sarus

The previous set of commands will load the GPU (or multicore) enabled software stack on Piz Daint and then load the environment of the default version of the program. You can either type this command every time you intend to use the program within a new session, or you can automatically load it by including it in your shell configuration file. The following module commands will print the environment variables set by loading the program and a help message:

module show sarus
module help sarus

Note for Eiger users

The version of Sarus on Eiger has been configured to work with the default CPE. It is recommended that you use a container image with MPICH 3.3.x. Please note that due to a limitation of cray-mpich when using PID namespace, if you use the native MPI hook (--mpi flag), you will only be able to use one rank per node. To overcome this limitation, please set MPICH_NOLOCAL=1 when submitting a job with multiple ranks per node, e.g.: 

MPICH_NOLOCAL=1 srun -N1 --ntasks-per-node=2 sarus run --mpi ethcscs/osu-mb:5.3.2-mpich3.1.4 ./osu_latency

For more instructions, kindly refer to the section How to run on Piz Daint below

How to run on Piz Daint

Here we provide essential instructions and system-specific information about Sarus on Piz Daint. For the full details about Sarus commands, options and features, please refer to the official User Guide on Read the Docs.

Managing images

Use the sarus pull command followed by the image descriptor to import Docker images from Docker Hub:

srun -C gpu sarus pull debian:latest

We strongly recommend to run the sarus pull command on the compute nodes through Slurm, so that Sarus can take advantage of their large RAM filesystem, which will greatly reduce the pull process time and will allow to pull larger images. Should you run into problems because the pulled image doesn't fit in the default filesystem, you can specify an alternative temporary directory with the --temp-dir option. When pulling images using an interactive terminal, we advise to use the --pty option of the srun command to improve the quality and clarity of the terminal output.

You can use the sarus images command to list the images available in your local repository:

sarus images
 REPOSITORY       TAG          DIGEST         CREATED               SIZE         SERVER
 alpine           latest       65e50dd72f89   2019-10-10T12:47:19   2.62MB       index.docker.io
 ubuntu           latest       d46a1a2fcda4   2019-10-10T12:47:28   24.69MB      index.docker.io

IMPORTANT: Please be aware that the local Sarus repository is individual for each user, and on Piz Daint is located inside the $SCRATCH/.sarus directory.

Sarus tries to closely follow Docker's command-line interface. To remove images you no longer need and recover disk space, use the sarus rmi command:

sarus rmi ubuntu:latest
 removed index.docker.io/library/ubuntu/latest

Running containers

Containers are run using Slurm and the sarus run command, specifying the desired image as the first positional argument of the command. The arguments entered after the image will be interpreted as the command to be executed inside the container:

srun -C gpu sarus run python:3-slim python --version
 Python 3.7.4

srun -C gpu sarus run debian cat /etc/os-release
 PRETTY_NAME="Debian GNU/Linux 10 (buster)"
 NAME="Debian GNU/Linux"
 VERSION_ID="10"
 VERSION="10 (buster)"
 VERSION_CODENAME=buster
 ID=debian
 HOME_URL="https://www.debian.org/"
 SUPPORT_URL="https://www.debian.org/support"
 BUG_REPORT_URL="https://bugs.debian.org/"

Accessing host directories from the container

On Piz Daint host filesystems are not automatically mounted inside containers.

If you need to work with the contents of your host directories, such as $SCRATCH or $HOME, you can use the --mount option of sarus run as in the following examples:

 # Mount $SCRATCH
srun -C gpu sarus run --mount=type=bind,source=$SCRATCH,destination=$SCRATCH debian ls -l $SCRATCH

# Mount $HOME
srun -C gpu sarus run --mount=type=bind,source=$HOME,destination=$HOME debian ls -l $HOME
 

Warning: Please be aware that mounting $HOME will make your SSH keys available inside the container, which represents a security risk.

For details about the --mount option, please refer to the official User Guide.

Environmental transfer

All the environment variables defined in the host process environment will be transferred into the container; however, any environment variable re-defined in the container image, e.g., Docker ENV-defined variables, will be sourced and override the values set on the host.

Additional features from OCI hooks

The following OCI hooks are enabled on Piz Daint:

  • Native MPI support (MPICH-based)

    Containers with native MPI performance can be launched by passing the --mpi option to the sarus run command, e.g.:

     srun -N16 -n16 -C gpu sarus run --mpi <repo name>/<image name>:<image tag> <mpi_application> 

    In order to access the high-speed Cray Aries interconnect, the container application must be dynamically linked to an MPI implementation that is ABI-compatible with the compute node's MPI on Piz Daint. We recommend one of the following MPI implementations:


  • NVIDIA GPU support

    Container images for GPU-accelerated applications must feature an installation of the CUDA Toolkit: please note that Piz Daint currently supports CUDA Toolkit versions from 8.0 up to 11.2. No direct user interaction is required to make GPUs available inside the container.

  • SSH connection within containers

  • Glibc replacement

Log messages from OCI hooks

Due to security hardening measures introduced in Sarus 1.6.3 and runc 1.1.12, log messages from OCI hooks within a Slurm job are no longer captured and redirected to the terminal or files by Slurm commands.

A partial workaround is to use the srun --pty option to print hook outputs only for the first task in the job step (a.k.a. task zero).

Logging messages from OCI hooks are mostly useful in conjunction with sarus --verbose and sarus --debug verbosity options to analyse or diagnose the container creation process.

In case of failure of an OCI hook, its logging outputs are printed by runc (the low-level OCI runtime currently used by Sarus), although in an unformatted fashion.

Work is ongoing to solve this issue in future Sarus installations.

Running MPI applications without native MPI support

The MPI replacement mechanism controlled by the --mpi option is not mandatory to run distributed applications in Sarus containers. It is possible to run containers using the MPI implementation embedded in the image, foregoing the performance of custom high-performance hardware. Containers are still launched through the host system process manager, hence this execution model is also known as "hybrid approach".

This can be useful in a number of scenarios:

  • the software stack in the container should not be altered in any way
  • non-performance-critical testing
  • impossibility to satisfy ABI compatibility for native hardware acceleration

On Piz Daint, the default process manager interface used by Slurm is the Cray PMI. It is possible to select the PMI-2 interface with the --mpi=pmi2 option to srun. PMI-2 is adopted by MPICH and MPICH-derived implementations, while OpenMPI has to be configured explicitly at build-time to support it.

The following example shows how to run the OSU point-to-point latency test from the Sarus cookbook on Piz Daint without native interconnect support:

srun -C gpu -N2 -t2 --mpi=pmi2 sarus run ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency
 ###MPI-3.0
 # OSU MPI Latency Test v5.6.1
 # Size          Latency (us)
 0                       6.66
 1                       6.81
 2                       6.88
 4                       6.88
 8                       6.85
 16                      6.79
 32                      6.88
 64                      6.86
 128                     6.85
 256                     6.84
 512                     6.66
 1024                    9.14
 2048                   10.03
 4096                   10.49
 8192                   11.21
 16384                  12.85
 32768                  16.11
 65536                  26.95
 131072                 51.95
 262144                 77.97
 524288                128.89
 1048576               229.30
 2097152               432.25
 4194304               839.49

Notice that the --mpi option was not passed to sarus run. To have a measure of the performance penalty, below is an example output for the same test with native MPI support enabled (lower numbers indicate better performance).

srun -C gpu -p debug -N2 -t2 sarus run --mpi ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency
 ###MPI-3.1
 # OSU MPI Latency Test v5.6.1
 # Size          Latency (us)
 0                       1.15
 1                       1.12
 2                       1.10
 4                       1.09
 8                       1.10
 16                      1.10
 32                      1.10
 64                      1.10
 128                     1.11
 256                     1.12
 512                     1.15
 1024                    1.39
 2048                    1.67
 4096                    2.27
 8192                    4.16
 16384                   5.03
 32768                   6.65
 65536                   9.98
 131072                 16.64
 262144                 29.94
 524288                 56.40
 1048576               109.25
 2097152               216.19
 4194304               430.82

For more information about running MPI applications in Sarus containers without native MPI support, please refer to the official User Guide.

Known issues

  • When running on a Piz Daint compute node, using the --workdir option to set the initial container directory to a location bind mounted from a subdirectory of the user's home directory results in an error. For example, the following command will encounter the aforementioned issue: srun -C gpu -A csstaff sarus run --mount=type=bind,source=$HOME/subdir,destination=/cwd --workdir /cwd alpine pwd

Further documentation