Sarus is a software to run Linux containers on High Performance Computing environments and is included in the CSCS Products Catalogue. Its development has been driven by the specific requirements of HPC systems, while leveraging open standards and technologies to encourage vendor and community involvement.
Key features:
Spawning of isolated software environments (containers), built by users to fit the deployment of a specific application
Security oriented to HPC systems
Extensible runtime by means of OCI hooks to allow current and future support of custom hardware while achieving native performance
Creation of container filesystems tailored for diskless nodes and parallel filesystems
Compatibility with the presence of a workload manager
Compatibility with the Open Container Initiative (OCI) standards:
- Can pull images from registries adopting the OCI Distribution Specification or the Docker Registry HTTP API V2 protocol (e.g. Docker Hub)
- Can import and convert images adopting the OCI Image Format (e.g. Docker images)
- Sets up a container bundle complying to the OCI Runtime Specification
- Uses an OCI-compliant runtime to spawn the container process
Licensing Terms and Conditions
The source code for Sarus is freely distributed under a BSD 3-Clause license.
Setup
Sarus is available from /usr/bin
on Eiger, while you should load daint-gpu
or daint-mc
before loading the sarus
modulefile on Piz Daint as shown in the example below:
module load daint-gpu # or daint-mc module load sarus
The previous set of commands will load the GPU (or multicore) enabled software stack on Piz Daint and then load the environment of the default version of the program. You can either type this command every time you intend to use the program within a new session, or you can automatically load it by including it in your shell configuration file. The following module commands will print the environment variables set by loading the program and a help message:
module show sarus module help sarus
Note for Eiger users
The version of Sarus on Eiger has been configured to work with the default CPE. It is recommended that you use a container image with MPICH 3.3.x. Please note that due to a limitation of cray-mpich
when using PID namespace, if you use the native MPI hook (--mpi
flag), you will only be able to use one rank per node. To overcome this limitation, please set MPICH_NOLOCAL=
1 when submitting a job with multiple ranks per node, e.g.:
MPICH_NOLOCAL=1 srun -N1 --ntasks-per-node=2 sarus run --mpi ethcscs/osu-mb:5.3.2-mpich3.1.4 ./osu_latency
For more instructions, kindly refer to the section How to run on Piz Daint below.
How to run on Piz Daint
Here we provide essential instructions and system-specific information about Sarus on Piz Daint. For the full details about Sarus commands, options and features, please refer to the official User Guide on Read the Docs.
Managing images
Use the sarus pull
command followed by the image descriptor to import Docker images from Docker Hub:
srun -C gpu sarus pull debian:latest
We strongly recommend to run the sarus pull
command on the compute nodes through Slurm, so that Sarus can take advantage of their large RAM filesystem, which will greatly reduce the pull process time and will allow to pull larger images. Should you run into problems because the pulled image doesn't fit in the default filesystem, you can specify an alternative temporary directory with the --temp-dir
option. When pulling images using an interactive terminal, we advise to use the --pty
option of the srun
command to improve the quality and clarity of the terminal output.
You can use the sarus images
command to list the images available in your local repository:
sarus images REPOSITORY TAG DIGEST CREATED SIZE SERVER alpine latest 65e50dd72f89 2019-10-10T12:47:19 2.62MB index.docker.io ubuntu latest d46a1a2fcda4 2019-10-10T12:47:28 24.69MB index.docker.io
IMPORTANT: Please be aware that the local Sarus repository is individual for each user, and on Piz Daint is located inside the $SCRATCH/.sarus
directory.
Sarus tries to closely follow Docker's command-line interface. To remove images you no longer need and recover disk space, use the sarus rmi
command:
sarus rmi ubuntu:latest removed index.docker.io/library/ubuntu/latest
Running containers
Containers are run using Slurm and the sarus run
command, specifying the desired image as the first positional argument of the command. The arguments entered after the image will be interpreted as the command to be executed inside the container:
srun -C gpu sarus run python:3-slim python --version Python 3.7.4 srun -C gpu sarus run debian cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 10 (buster)" NAME="Debian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
Accessing host directories from the container
On Piz Daint host filesystems are not automatically mounted inside containers.
If you need to work with the contents of your host directories, such as $SCRATCH
or $HOME
, you can use the --mount
option of sarus run
as in the following examples:
# Mount $SCRATCH srun -C gpu sarus run --mount=type=bind,source=$SCRATCH,destination=$SCRATCH debian ls -l $SCRATCH # Mount $HOME srun -C gpu sarus run --mount=type=bind,source=$HOME,destination=$HOME debian ls -l $HOME
Warning: Please be aware that mounting $HOME
will make your SSH keys available inside the container, which represents a security risk.
For details about the --mount
option, please refer to the official User Guide.
Environmental transfer
All the environment variables defined in the host process environment will be transferred into the container; however, any environment variable re-defined in the container image, e.g., Docker ENV-defined variables, will be sourced and override the values set on the host.
Additional features from OCI hooks
The following OCI hooks are enabled on Piz Daint:
Native MPI support (MPICH-based)
Containers with native MPI performance can be launched by passing the
--mpi
option to thesarus run
command, e.g.:srun -N16 -n16 -C gpu sarus run --mpi <repo name>/<image name>:<image tag> <mpi_application>
In order to access the high-speed Cray Aries interconnect, the container application must be dynamically linked to an MPI implementation that is ABI-compatible with the compute node's MPI on Piz Daint. We recommend one of the following MPI implementations:
- MPICH v3.1.4 (Feburary 2015)
- MVAPICH2 2.2 (September 2016)
- Intel MPI Library 2017 Update 1
Container images for GPU-accelerated applications must feature an installation of the CUDA Toolkit: please note that Piz Daint currently supports CUDA Toolkit versions from 8.0 up to 11.2. No direct user interaction is required to make GPUs available inside the container.
Log messages from OCI hooks
Due to security hardening measures introduced in Sarus 1.6.3 and runc 1.1.12, log messages from OCI hooks within a Slurm job are no longer captured and redirected to the terminal or files by Slurm commands.
A partial workaround is to use the srun --pty
option to print hook outputs only for the first task in the job step (a.k.a. task zero).
Logging messages from OCI hooks are mostly useful in conjunction with sarus --verbose
and sarus --debug
verbosity options to analyse or diagnose the container creation process.
In case of failure of an OCI hook, its logging outputs are printed by runc (the low-level OCI runtime currently used by Sarus), although in an unformatted fashion.
Work is ongoing to solve this issue in future Sarus installations.
Running MPI applications without native MPI support
The MPI replacement mechanism controlled by the --mpi
option is not mandatory to run distributed applications in Sarus containers. It is possible to run containers using the MPI implementation embedded in the image, foregoing the performance of custom high-performance hardware. Containers are still launched through the host system process manager, hence this execution model is also known as "hybrid approach".
This can be useful in a number of scenarios:
- the software stack in the container should not be altered in any way
- non-performance-critical testing
- impossibility to satisfy ABI compatibility for native hardware acceleration
On Piz Daint, the default process manager interface used by Slurm is the Cray PMI. It is possible to select the PMI-2 interface with the --mpi=pmi2
option to srun
. PMI-2 is adopted by MPICH and MPICH-derived implementations, while OpenMPI has to be configured explicitly at build-time to support it.
The following example shows how to run the OSU point-to-point latency test from the Sarus cookbook on Piz Daint without native interconnect support:
srun -C gpu -N2 -t2 --mpi=pmi2 sarus run ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency ###MPI-3.0 # OSU MPI Latency Test v5.6.1 # Size Latency (us) 0 6.66 1 6.81 2 6.88 4 6.88 8 6.85 16 6.79 32 6.88 64 6.86 128 6.85 256 6.84 512 6.66 1024 9.14 2048 10.03 4096 10.49 8192 11.21 16384 12.85 32768 16.11 65536 26.95 131072 51.95 262144 77.97 524288 128.89 1048576 229.30 2097152 432.25 4194304 839.49
Notice that the --mpi
option was not passed to sarus run
. To have a measure of the performance penalty, below is an example output for the same test with native MPI support enabled (lower numbers indicate better performance).
srun -C gpu -p debug -N2 -t2 sarus run --mpi ethcscs/mpich:ub1804_cuda92_mpi314_osu ./osu_latency ###MPI-3.1 # OSU MPI Latency Test v5.6.1 # Size Latency (us) 0 1.15 1 1.12 2 1.10 4 1.09 8 1.10 16 1.10 32 1.10 64 1.10 128 1.11 256 1.12 512 1.15 1024 1.39 2048 1.67 4096 2.27 8192 4.16 16384 5.03 32768 6.65 65536 9.98 131072 16.64 262144 29.94 524288 56.40 1048576 109.25 2097152 216.19 4194304 430.82
For more information about running MPI applications in Sarus containers without native MPI support, please refer to the official User Guide.
Known issues
- When running on a Piz Daint compute node, using the
--workdir
option to set the initial container directory to a location bind mounted from a subdirectory of the user's home directory results in an error. For example, the following command will encounter the aforementioned issue:srun -C gpu -A csstaff sarus run --mount=type=bind,source=$HOME/subdir,destination=/cwd --workdir /cwd alpine pwd