This page documents the interfaces for the preconfigured CI/CD runners provided by CSCS. A fixed interface allows CSCS to change technologies without breaking users' CI pipelines. For example, the container-builder
interface takes as input a Dockerfile, and pushes as output a container image to the container registry. The implementation details of transforming the Dockerfile to a container image are hidden, and from a user perspective it is irrelevant whether the image is built using docker, podman, buildah, etc.
Each runner and its interface are described below. To use a runner provided by CSCS you must include the runners configuration yaml file in your pipeline like this:container-runner
include: - remote: 'https://gitlab.com/cscs-ci/recipes/-/raw/master/templates/v2/.ci-ext.yml'
container-builder
We provide the container-builder
runner for every available CPU architecture at CSCS. This runner takes as input a Dockerfile, builds a container image based on the recipe in the Dockerfile and publishes the image to an OCI registry.
The naming for the runner is container-builder-cscs-<MICROARCHITECTURE>
.
The following runners are available:
.container-builder-cscs-zen2
.container-builder-cscs-gh200
The supported variables are:
Variable | Description |
---|---|
DOCKERFILE | Relative path in your repository to the Dockerfile recipe. This variable is mandatory and has no default value. EXAMPLE: |
PERSIST_IMAGE_NAME | The path where to store the container image. CSCS provides a registry through the variable See also dependency management for common naming and third-party registry usage. This variable is mandatory and has no default value. EXAMPLE: |
CSCS_BUILD_IN_MEMORY | Instruct the runner that the whole build process will build in memory. The default value is This variable is optional and is EXAMPLE: |
DOCKER_BUILD_ARGS | This allows the usage of the keyword This variable is optional and the default value is an empty string. EXAMPLE: |
CSCS_REBUILD_POLICY | This variable has two possible values:
When the value is When the value is This variable is optional and the default value is EXAMPLE: |
SECONDARY_REGISTRY | Allows pushing also to This variable is optional and is empty by default. EXAMPLE: |
SECONDARY_REGISTRY_USERNAME | The username to push to This variable is optional and is empty by default. EXAMPLE: |
SECONDARY_REGISTRY_PASSWORD | The password/token to push to For security you should store a secret variable on the CI setup page, and forward it in the job yaml. This variable is optional and is empty by default. EXAMPLE: |
CUSTOM_REGISTRY_USERNAME | If This variable is optional and is empty by default. EXAMPLE: |
CUSTOM_REGISTRY_PASSWORD | For security you should store a secret variable on the CI setup page, and forward it in the job yaml. EXAMPLE: |
Two container build arguments are always provided during the image build process:
CSCS_REGISTRY_PATH
During the build process this is exposed as build argument, such that you can write
ARG CSCS_REGISTRY_PATH FROM $CSCS_REGISTRY_PATH/some_subdirectory/my_image:latest
NUM_PROCS
This is an integer value, with the number of cores allocated for your build job. This allows you to write in your Dockerfile something like this:
ARG NUM_PROCS RUN cd build && make -j$NUM_PROCS
During the image build the source code is both the build context and also bind-mounted at /sourcecode
This means that you can copy source files inside the image through the following instructions in the Dockerfile recipe:
COPY . /tmp/my_sourcecode
RUN cp -a /sourcecode /tmp/my_sourcecode
If you do not need the source code at all in the final container, you can also work directly with the source code in /sourcecode
. The path is mounted read-only, therefore your software MUST support out-of-source builds.
The following two examples demonstrate CI jobs for building a container image
job1: extends: .container-builder-cscs-zen2 variables: DOCKERFILE: ci/docker/Dockerfile PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/x86_64/my_image:$CI_COMMIT_SHORT_SHA job2: extends: .container-builder-cscs-gh200 variables: DOCKERFILE: ci/docker/Dockerfile PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/aarch64/my_image:$CI_COMMIT_SHORT_SHA
container-runner
This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.
The naming for the runner is container-runner-<CLUSTERNAME>-<MICROARCHITECTURE>
The following runners are available:
container-runner-eiger-zen2
container-runner-daint-gh200
The container image is specified in the tag image
in the job yaml. This tag is mandatory. Supported variables are:
Variable | Additional information |
---|---|
GIT_STRATEGY | This is a default Gitlab variable, but mentioned here explicitly, because very often you do not need to clone the repository sourcecode when you run your containerized application. The default is EXAMPLE VALUE: |
CRAY_CUDA_MPS | Enable running with nvidia-mps-server, which allows multiple ranks sharing the same GPU. This variable is optional and the default value is NO. EXAMPLE VALUE: |
USE_MPI | Enable running with MPI hooks enabled. This allows to inject the host MPI library inside the container runtime for native MPI speed. This variable is optional and the default value is EXAMPLE VALUE: |
| This allows mounting user defined host directories inside the container. The value must be a valid JSON array, where each entry is of the form:
This variable is optional and by default empty. EXAMPLE VALUE: |
The following two examples demonstrate CI jobs for running with a container image
job1: extends: .container-runner-daint-gh200 image: $CSCS_REGISTRY_PATH/aarch64/my_image:$CI_COMMIT_SHORT_SHA script: - /usr/bin/my_application /data/some_input.xml variables: CSCS_ADDITIONAL_MOUNTS: '["/capstor/scratch/cscs/<my_username>/data:/data"]' job2: extends: .container-runner-eiger-zen2 image: $CSCS_REGISTRY_PATH/x86_64/my_image:$CI_COMMIT_SHORT_SHA script: - /usr/bin/my_application ./data_in_repository.txt variables: GIT_STRATEGY: fetch
container-runner-lightweight
This runner allows lightweight jobs that do not need many resources. The advantage is that the job is not running via SLURM and can therefore start faster. The maximum timeout for this runner is 60 minutes and you can request at most only 4 CPUs and 4GB of memory. If your job does not fit these requirements, then you must use the default container-runner.
Typical examples of when this runner is the right choice (not limited to these use cases though):
- Upload code coverage artifacts
- Create a dynamic pipeline yaml file
Currently only a runner on Zen2 microarchitecture exists.
The naming for the runner is container-runner-lightweight-<MICROARCHITECTURE>
The following runners are available:
.container-runner-lightweight-zen2
This runner is restricted to public images. It is not possible to run an image that cannot be pulled anonymously. If you have built a container image in a previous stage and stored it in $CSCS_REGISTRY_PATH
, then you must ensure that it is in a subdirectory with the name public
, i.e., the image path must match the wildcard $CSCS_REGISTRY_PATH/**/public/**
.
You can set the CPU and memory requests/limits with variables. A request specifies the minimum amount of resources that your job requires. Your job will not be scheduled until the requested resources are available. A limit is the maximum that your job might be able to use if available, but the job is not guaranteed to be allocated that limit.
Variable | Additional information |
---|---|
KUBERNETES_CPU_REQUEST | Number of CPUs minimally needed to schedule this job. This variable is optional and the default value is 1. EXAMPLE VALUE: |
KUBERNETES_CPU_LIMIT | Limit the job to use at most that many CPUs. This variable is optional and the default value is 1. EXAMPLE VALUE: |
KUBERNETES_MEMORY_REQUEST | The amount of memory minimally needed to schedule the job. This variable is optional and the default value is 1Gi. EXAMPLE VALUE: |
KUBERNETES_MEMORY_LIMIT | Limit the job to use at most this much memory. You will get an OOM (out-of-memory) error, if you exceed the limit. This variable is optional and the default value is 1Gi. EXAMPLE VALUE: |
The following example demonstrates running a lightweight containerized job:
job: extends: .container-runner-lightweight-zen2 image: docker.io/python:3.11 script: - ci/pipeline/generate_pipeline.py > dynamic_pipeline.yaml artifacts: paths: - dynamic_pipeline.yaml
uenv-builder
This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.
The naming for the runner is uenv-builder-<CLUSTERNAME>-<MICROARCHITECTURE>
The following runners are available:
uenv-builder-eiger-zen2
uenv-builder-daint-gh200
uenv-builder
is very similar to the container-builder , the main difference is that you are building a uenv based on a recipe directory instead of Dockerfile.
The uenv will be registered under the name $UENV_NAME/$UENV_VERSION:$UENV_TAG
.
A uenv will only be rebuilt, if there is no uenv already registered under that name.
The tag's default value is calculated as a hash from the contents of your uenv recipe yaml files, which ensures that a uenv is rebuilt every time the content of the recipe's yaml files changes. Additionally to the computed hash value, the uenv image will also be registered under the name $UENV_NAME/$UENV_VERSION:$CI_PIPELINE_ID
, which allows to refer to the image in subsequent uenv-runner
-jobs.
Variable | Additional information |
---|---|
UENV_NAME | The name of the uenv The default is empty, and you must explicitly set it to a name. Use alpha-numeric characters, and EXAMPLE VALUE: |
UENV_VERSION | The version of the uenv The default is empty, and you must explicitly set it to a version. Use alpha-numeric characters, and EXAMPLE VALUE: |
UENV_RECIPE | The path to the directory containing the recipe yaml files. The default is empty and you must explicitly set it to the relative directory in your repository. EXAMPLE VALUE: |
UENV_TAG | Optional parameter. By default the tag will be a computed hash of the contents in You can set to an explicit (changing) tag, if you want to opt-out of the feature that a uenv is only rebuilt, when the contents of the recipe yaml files changes. |
The following two examples demonstrate CI jobs for building a uenv
job1: extends: .uenv-builder-eiger-zen2 variables: UENV_NAME: prgenv-gnu UENV_VERSION: 24.10 UENV_RECIPE: ci/uenv-recipes/prgenv-gnu/eiger-zen2 job2: extends: .uenv-builder-daint-gh200 variables: UENV_NAME: prgenv-gnu UENV_VERSION: 24.10 UENV_RECIPE: ci/uenv-recipes/prgenv-gnu/daint-gh200
uenv-runner
This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.
The naming for the runner is uenv-runner-<CLUSTERNAME>-<MICROARCHITECTURE>
The following runners are available:
uenv-runner-eiger-zen2
uenv-runner-daint-gh200
uenv-runner
is very similar to the container-runner, the main difference is that you are running with a uenv image mounted instead of inside a container.
The uenv image is specified in the tag image
in the job yaml. This tag is mandatory. Supported variables are:
Variable | Additional information |
---|---|
UENV_VIEW | This is the view to load together with the uenv. The default is empty, and you must explicitly set it to a valid view corresponding to the uenv. EXAMPLE VALUE: |
CRAY_CUDA_MPS | Enable running with nvidia-mps-server, which allows multiple ranks sharing the same GPU. This variable is optional and the default value is NO. EXAMPLE VALUE: |
The following two examples demonstrate CI jobs for running with a uenv image
job1: extends: .uenv-runner-eiger-zen2 image: prgenv-gnu/24.7:v3 script: - gcc --version variables: UENV_VIEW: 'default' job2: extends: .uenv-runner-daint-gh200 image: gromacs/2024:v1 script: - gmx_mpi --version variables: UENV_VIEW: 'gromacs' SLURM_JOB_NUM_NODES: 1 SLURM_NTASKS: 4
baremetal-runner
This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.
The naming for the runner is baremetal-runner-<CLUSTERNAME>-<MICROARCHITECTURE>
The following runners are available:
baremetal-runner-eiger-zen2
baremetal-runner-daint-gh200
This runner mode is almost equivalent to writing an SBATCH script. Instead of #SBATCH
instructions, you need to use the SLURM_*
variables to specify your SLURM requirements. Otherwise all commands are as typical with SBATCH executed only on the first node, unless you run it explicitly via srun
with multiple ranks.
The following example demonstrates running a lightweight containerized job:
job: extends: .baremetal-runner-daint-gh200 script: - hostname - srun --uenv prgenv-gnu/24.7:v3 --view=default gcc --version variables: SLURM_JOB_NUM_NODES: 1
f7t-controller
This runner allows submitting jobs to clusters using FirecREST. With this runner, all the dependencies for submitting jobs with FirecREST are already available in the environment. You can either use the client tool firecrest
, or a python script that uses the pyfirecrest library. When the job starts, the runner will set four environment variables, which are needed to allow submitting jobs through FirecREST. These variables are:
AUTH_TOKEN_URL
: This is the same value as the variableF7T_TOKEN_URL
in the job descriptionFIRECREST_URL
: This is the same value as the variableF7T_URL
in the job descriptionFIRECREST_CLIENT_ID
: The value that is set the CI setup page in the admin sectionFIRECREST_CLIENT_SECRET
: The value that is set in the CI setup page in the admin section
A job can be submitted with the client, e.g. via
firecrest submit --system eiger --account $CSCS_CI_DEFAULT_SLURM_ACCOUNT my_script.sh
The following example demonstrates submitting a CI job and waiting for it to finish:
job: extends: .f7t-controller script: - CLUSTER=eiger - SUBMISSION="$(firecrest submit --system $CLUSTER --account $CSCS_CI_DEFAULT_SLURM_ACCOUNT script.sh)" - JOBID=$(echo "$SUBMISSION" | grep "jobid" | sed -e 's/.*jobid[^0-9]*\([0-9]\+\),/\1/') - | while firecrest poll-active --raw --system $CLUSTER | grep $JOBID ; do echo "job is still in queue/running" sleep 30 done
reframe-runner
This runner will run ReFrame. ReFrame jobs are submitted with FirecREST. This runner is a thin wrapper over the f7t-controller. The machine where reframe is running does not have to be a powerful machine, hence it does not make sense to start the reframe process from a compute node. It makes more sense to start the reframe process on a weaker machine and submit the compute jobs through FirecREST to the actual cluster.
Variable | Additional information |
---|---|
RFM_VERSION | This reframe version will be available on the machine where the job is running. This variable is optional and defaults to a recent version of ReFrame EXAMPLE VALUE: |
RFM_CONFIG | The path to the config that is passed to reframe through This variable is mandatory, unless you overwrite the EXAMPLE VALUE: |
RFM_CHECKPATH | The path to the checks that is passed to reframe through This variable is mandatory, unless you overwrite the EXAMPLE VALUE: |
The default command that is executed is
reframe -C $RFM_CONFIG -c $RFM_CHECKPATH -Sbuild_locally=0 --report-junit=report.xml -r
This default can be overwritten, by providing a user-defined script
tag in the job.
The following example demonstrates submitting a CI job to run ReFrame.
job: before_script: - git clone -b alps https://github.com/eth-cscs/cscs-reframe-tests - pip install -r cscs-reframe-tests/config/utilities/requirements.txt - sed -i -e "s/account=csstaff/account=$CSCS_CI_DEFAULT_SLURM_ACCOUNT/" cscs-reframe-tests/config/systems-firecrest/eiger.py variables: FIRECREST_SYSTEM: 'eiger' FIRECREST_BASEDIR: /capstor/scratch/cscs/jenkssl/reframe-runner RFM_FIRECREST: '1' RFM_CONFIG: cscs-reframe-tests/config/cscs.py RFM_CHECKPATH: cscs-reframe-tests/checks/microbenchmarks/mpi/halo_exchange
The easiest way to use the FirecREST scheduler of ReFrame is to use the configuration files that are provided in the alps branch of this repository: https://github.com/eth-cscs/cscs-reframe-tests. In case you want to run ReFrame for a system that is not already available in this directory, please open a ticket to the Service Desk and we will add it or help you update one of the existing ones.
Something you should be aware of when running with this scheduler is that ReFrame will not have direct access to the filesystem of the cluster so the stage directory will need to be kept in sync through FirecREST. It is recommended to try to clean the stage directory whenever possible with the postrun_cmds and postbuild_cmds and to avoid autodetection of the processor in each run. Normally ReFrame stores these files in ~/.reframe/topology/{system}-{part}/processor.json
, but you get a "clean" runner every time. You could either add them in the configuration files or store the files in the first run and copy them to the right directory before ReFrame runs.
Finally, you can find some more information here: https://github.com/eth-cscs/cscs-reframe-tests/blob/alps/config/systems-firecrest/README.md
FirecREST
This is not a runner per se, but since most runners are built on top of FirecREST some relevant notes how CI is interacting with FirecREST.
CI will submit jobs with the FirecREST client id/secret that have been stored at the CI setup page at https://cicd-ext-mw.cscs.ch (in the Admin section). Storing the client id/secret is mandatory, because most runners will not work without these credentials.
The credentials are tied to a CSCS username, hence the pipeline will run within the context of this user. It is possible and encouraged to request with a Service Desk ticket a CI service account. Then the FirecREST credentials can be tied to the CI service account.
You will always need 4 pieces of information to interact with FirecREST:
- Token dispenser URL
- API endpoint URL
- Client ID
- Client Secret
In the CI context the token dispenser URL is passed with the variable F7T_TOKEN_URL
, the API endpoint is passed with the variable F7T_URL
. The client ID/Secret are stored in the CI setup page.
In a nutshell, the client ID and client secret are used to request from the token dispenser URL an access token. The token dispenser will reply with an access token, if and only if the client ID/secret pair is valid. This access token is then used to authenticate the API requests that are being sent to the FirecREST API endpoint.
The documented runners above, set the correct F7T_TOKEN_URL
and F7T_URL
for the respective clusters. When you are running on the f7t-controller
runner, then you might have to modify the default variables, because this runner is not targeting a specific cluster, but it can target different clusters in the same job. Targeting different clusters in the same job can require to provide different F7T_URL
. The F7T_TOKEN_URL
is currently the same for any cluster.