Sarus is a software to run Linux containers on High Performance Computing environments and is included in the CSCS Products Catalogue. Its development has been driven by the specific requirements of HPC systems, while leveraging open standards and technologies to encourage vendor and community involvement.

Key features:

  • Spawning of isolated software environments (containers), built by users to fit the deployment of a specific application

  • Security oriented to HPC systems

  • Extensible runtime by means of OCI hooks to allow current and future support of custom hardware while achieving native performance

  • Creation of container filesystems tailored for diskless nodes and parallel filesystems

  • Compatibility with the presence of a workload manager

  • Compatibility with the Open Container Initiative (OCI) standards:

    • Can pull images from registries adopting the OCI Distribution Specification or the Docker Registry HTTP API V2 protocol (e.g. Docker Hub)
    • Can import and convert images adopting the OCI Image Format (e.g. Docker images)
    • Sets up a container bundle complying to the OCI Runtime Specification
    • Uses an OCI-compliant runtime to spawn the container process

Licensing Terms and Conditions

The source code for Sarus is freely distributed under a BSD 3-Clause license.

Setup

Sarus is available from /usr/bin on Eiger: the version of Sarus on Eiger has been configured to work with the default CPE. It is recommended that you use a container image with MPICH 3.3.x. Please note that due to a limitation of cray-mpich when using PID namespace, if you use the native MPI hook (--mpi flag), you will only be able to use one rank per node. To overcome this limitation, please set MPICH_NOLOCAL=1 when submitting a job with multiple ranks per node, e.g.: 

MPICH_NOLOCAL=1 srun -N1 --ntasks-per-node=2 sarus run --mpi ethcscs/osu-mb:5.3.2-mpich3.1.4 ./osu_latency

For more instructions, kindly refer to the section How to run below

How to run 

Here we provide essential instructions and system-specific information about Sarus. For the full details about Sarus commands, options and features, please refer to the official User Guide on Read the Docs.

Managing images

Use the sarus pull command followed by the image descriptor to import Docker images from Docker Hub:

srun -C mc sarus pull debian:latest

We strongly recommend to run the sarus pull command on the compute nodes through Slurm, so that Sarus can take advantage of their large RAM filesystem, which will greatly reduce the pull process time and will allow to pull larger images. Should you run into problems because the pulled image doesn't fit in the default filesystem, you can specify an alternative temporary directory with the --temp-dir option. When pulling images using an interactive terminal, we advise to use the --pty option of the srun command to improve the quality and clarity of the terminal output.

You can use the sarus images command to list the images available in your local repository:

sarus images
 REPOSITORY       TAG          DIGEST         CREATED               SIZE         SERVER
 alpine           latest       65e50dd72f89   2019-10-10T12:47:19   2.62MB       index.docker.io
 ubuntu           latest       d46a1a2fcda4   2019-10-10T12:47:28   24.69MB      index.docker.io

IMPORTANT: Please be aware that the local Sarus repository is personal inside the $SCRATCH/.sarus directory

Sarus tries to closely follow Docker's command-line interface. To remove images you no longer need and recover disk space, use the sarus rmi command:

sarus rmi ubuntu:latest
 removed index.docker.io/library/ubuntu/latest

Running containers

Containers are run using Slurm and the sarus run command, specifying the desired image as the first positional argument of the command. The arguments entered after the image will be interpreted as the command to be executed inside the container:

srun -C mc sarus run python:3-slim python --version
 Python 3.7.4

srun -C gpu sarus run debian cat /etc/os-release
 PRETTY_NAME="Debian GNU/Linux 10 (buster)"
 NAME="Debian GNU/Linux"
 VERSION_ID="10"
 VERSION="10 (buster)"
 VERSION_CODENAME=buster
 ID=debian
 HOME_URL="https://www.debian.org/"
 SUPPORT_URL="https://www.debian.org/support"
 BUG_REPORT_URL="https://bugs.debian.org/"

Accessing host directories from the container

Host filesystems are not automatically mounted inside containers.

If you need to work with the contents of your host directories, such as $SCRATCH or $HOME, you can use the --mount option of sarus run as in the following examples:

 # Mount $SCRATCH
srun -C mc sarus run --mount=type=bind,source=$SCRATCH,destination=$SCRATCH debian ls -l $SCRATCH

# Mount $HOME
srun -C mc sarus run --mount=type=bind,source=$HOME,destination=$HOME debian ls -l $HOME

Warning: Please be aware that mounting $HOME will make your SSH keys available inside the container, which represents a security risk.

For details about the --mount option, please refer to the official User Guide.

Environmental transfer

All the environment variables defined in the host process environment will be transferred into the container; however, any environment variable re-defined in the container image, e.g., Docker ENV-defined variables, will be sourced and override the values set on the host.

Additional features from OCI hooks

The following OCI hooks are enabled:

  • Native MPI support (MPICH-based)

    Containers with native MPI performance can be launched by passing the --mpi option to the sarus run command, e.g.:

     srun -N16 -n16 -C mc sarus run --mpi <repo name>/<image name>:<image tag> <mpi_application> 

    In order to access the high-speed Cray Aries interconnect, the container application must be dynamically linked to an MPI implementation that is ABI-compatible with the compute node's MPI. We recommend one of the following MPI implementations:

Log messages from OCI hooks

Due to security hardening measures introduced in Sarus 1.6.3 and runc 1.1.12, log messages from OCI hooks within a Slurm job are no longer captured and redirected to the terminal or files by Slurm commands.

A partial workaround is to use the srun --pty option to print hook outputs only for the first task in the job step (a.k.a. task zero).

Logging messages from OCI hooks are mostly useful in conjunction with sarus --verbose and sarus --debug verbosity options to analyse or diagnose the container creation process.

In case of failure of an OCI hook, its logging outputs are printed by runc (the low-level OCI runtime currently used by Sarus), although in an unformatted fashion.

Work is ongoing to solve this issue in future Sarus installations.

Running MPI applications without native MPI support

The MPI replacement mechanism controlled by the --mpi option is not mandatory to run distributed applications in Sarus containers. It is possible to run containers using the MPI implementation embedded in the image, foregoing the performance of custom high-performance hardware. Containers are still launched through the host system process manager, hence this execution model is also known as "hybrid approach".

This can be useful in a number of scenarios:

  • the software stack in the container should not be altered in any way
  • non-performance-critical testing
  • impossibility to satisfy ABI compatibility for native hardware acceleration

The default process manager interface used by Slurm is the Cray PMI. It is possible to select the PMI-2 interface with the --mpi=pmi2 option to srun. PMI-2 is adopted by MPICH and MPICH-derived implementations, while OpenMPI has to be configured explicitly at build-time to support it.

The following example shows how to run the OSU point-to-point latency test from the Sarus cookbook without native interconnect support:

srun -C mc -N2 -t2 --mpi=pmi2 sarus run ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency
 ###MPI-3.0
 # OSU MPI Latency Test v5.6.1
 # Size          Latency (us)
 0                       6.66
 1                       6.81
 2                       6.88
 4                       6.88
 8                       6.85
 16                      6.79
 32                      6.88
 64                      6.86
 128                     6.85
 256                     6.84
 512                     6.66
 1024                    9.14
 2048                   10.03
 4096                   10.49
 8192                   11.21
 16384                  12.85
 32768                  16.11
 65536                  26.95
 131072                 51.95
 262144                 77.97
 524288                128.89
 1048576               229.30
 2097152               432.25
 4194304               839.49

Notice that the --mpi option was not passed to sarus run. To have a measure of the performance penalty, below is an example output for the same test with native MPI support enabled (lower numbers indicate better performance).

srun -C mc -p debug -N2 -t2 sarus run --mpi ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency
 ###MPI-3.1
 # OSU MPI Latency Test v5.6.1
 # Size          Latency (us)
 0                       1.15
 1                       1.12
 2                       1.10
 4                       1.09
 8                       1.10
 16                      1.10
 32                      1.10
 64                      1.10
 128                     1.11
 256                     1.12
 512                     1.15
 1024                    1.39
 2048                    1.67
 4096                    2.27
 8192                    4.16
 16384                   5.03
 32768                   6.65
 65536                   9.98
 131072                 16.64
 262144                 29.94
 524288                 56.40
 1048576               109.25
 2097152               216.19
 4194304               430.82

For more information about running MPI applications in Sarus containers without native MPI support, please refer to the official User Guide.

Known issues

  • When running on a compute node, using the --workdir option to set the initial container directory to a location bind mounted from a subdirectory of the user's home directory results in an error. For example, the following command will encounter the aforementioned issue: srun -c mc sarus run --mount=type=bind,source=$HOME/subdir,destination=/cwd --workdir /cwd alpine pwd

Further documentation