Runners

This page documents the interfaces for the preconfigured CI/CD runners provided by CSCS. A fixed interface allows CSCS to change technologies without breaking users' CI pipelines. For example, the container-builder interface takes as input a Dockerfile, and pushes as output a container image to the container registry. The implementation details of transforming the Dockerfile to a container image are hidden, and from a user perspective it is irrelevant whether the image is built using docker, podman, buildah, etc.

Each runner and its interface are described below. To use a runner provided by CSCS you must include the runners configuration yaml file in your pipeline like this:

include:
  - remote: 'https://gitlab.com/cscs-ci/recipes/-/raw/master/templates/v2/.ci-ext.yml'

container-builder

We provide the container-builder runner for every available CPU architecture at CSCS. This runner takes as input a Dockerfile, builds a container image based on the recipe in the Dockerfile and publishes the image to an OCI registry.

The naming for the runner is container-builder-cscs-<MICROARCHITECTURE>.

The following runners are available:

.container-builder-cscs-zen2
.container-builder-cscs-gh200

The supported variables are:

Variable	Description
`DOCKERFILE`	Relative path in your repository to the Dockerfile recipe. This variable is mandatory and has no default value. EXAMPLE: `ci/docker/Dockerfile`
`PERSIST_IMAGE_NAME`	The path where to store the container image. CSCS provides a registry through the variable `CSCS_REGISTRY_PATH` . Images stored in the CSCS provided registry can only be accessed from the CSCS network. A pipeline has read and write access to any path inside `$CSCS_REGISTRY_PATH` . See also dependency management for common naming and third-party registry usage. This variable is mandatory and has no default value. EXAMPLE: `$CSCS_REGISTRY_PATH/some_subdirectory/my_image:$CI_COMMIT_SHORT_SHA`
`CSCS_BUILD_IN_MEMORY`	Instruct the runner that the whole build process will build in memory. The default value is `TRUE`, and you should only set it to `FALSE` if you see your job failing due to out-of-memory errors. This variable is optional and is `TRUE` by default. EXAMPLE: `TRUE`
`DOCKER_BUILD_ARGS`	This allows the usage of the keyword `ARG` in your Dockerfile. The value must be a valid JSON array, and it is almost always correct to wrap the full value in single-quotes. This variable is optional and the default value is an empty string. EXAMPLE: `'["MY_BUILD_ARG1=SOME_VALUE", "ANOTHER_BUILD_ARG=123"]'`
`CSCS_REBUILD_POLICY`	This variable has two possible values: `if-not-exists` `always` When the value is `always`, then a new container image will always we built. When the value is `if-not-exists`, then the runner will first check on the registry if the image `$PERSIST_IMAGE_NAME` exists already. If it exists, then the runner will not rebuild the image. This is useful to disable rebuilding of base containers. This variable is optional and the default value is `if-not-exists` . EXAMPLE: `if-not-exists`
`SECONDARY_REGISTRY`	Allows pushing also to `$SECONDARY_REGISTRY`, additionally to `$PERSIST_IMAGE_NAME`. The result image will pushed to both registries. This variable is optional and is empty by default. EXAMPLE: `docker.io/my_username/my_image:1.0`
`SECONDARY_REGISTRY_USERNAME`	The username to push to `$SECONDARY_REGISTRY` This variable is optional and is empty by default. EXAMPLE: `my_username`
`SECONDARY_REGISTRY_PASSWORD`	The password/token to push to `$SECONDARY_REGISTRY` For security you should store a secret variable on the CI setup page, and forward it in the job yaml. This variable is optional and is empty by default. EXAMPLE: `$DOCKERHUB_TOKEN` (assuming you stored DOCKERHUB_TOKEN as varaible in the CI setup page)
`CUSTOM_REGISTRY_USERNAME`	If `$PERSIST_IMAGE_NAME` is not inside the CSCS default registry, then you have to provide the credentials for pushing to the registry. This variable is optional and is empty by default. EXAMPLE: `my_username`
`CUSTOM_REGISTRY_PASSWORD`	For security you should store a secret variable on the CI setup page, and forward it in the job yaml. EXAMPLE: `$DOCKERHUB_TOKEN` (assuming you stored DOCKERHUB_TOKEN as varaible in the CI setup page)

Two container build arguments are always provided during the image build process:

CSCS_REGISTRY_PATH During the build process this is exposed as build argument, such that you can write

ARG CSCS_REGISTRY_PATH
FROM $CSCS_REGISTRY_PATH/some_subdirectory/my_image:latest

NUM_PROCS This is an integer value, with the number of cores allocated for your build job. This allows you to write in your Dockerfile something like this:

ARG NUM_PROCS
RUN cd build && make -j$NUM_PROCS

During the image build the source code is both the build context and also bind-mounted at /sourcecode This means that you can copy source files inside the image through the following instructions in the Dockerfile recipe:

COPY . /tmp/my_sourcecode
RUN cp -a /sourcecode /tmp/my_sourcecode

If you do not need the source code at all in the final container, you can also work directly with the source code in /sourcecode. The path is mounted read-only, therefore your software MUST support out-of-source builds.

The following two examples demonstrate CI jobs for building a container image

job1:
  extends: .container-builder-cscs-zen2
  variables:
    DOCKERFILE: ci/docker/Dockerfile
    PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/x86_64/my_image:$CI_COMMIT_SHORT_SHA

job2:
  extends: .container-builder-cscs-gh200
  variables:
    DOCKERFILE: ci/docker/Dockerfile
    PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/aarch64/my_image:$CI_COMMIT_SHORT_SHA

container-runner

This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.

The naming for the runner is container-runner-<CLUSTERNAME>-<MICROARCHITECTURE>

The following runners are available:

container-runner-eiger-zen2
container-runner-daint-gh200

The container image is specified in the tag image in the job yaml. This tag is mandatory. Supported variables are:

Variable	Additional information
`GIT_STRATEGY`	This is a default Gitlab variable, but mentioned here explicitly, because very often you do not need to clone the repository sourcecode when you run your containerized application. The default is `none`, and you must explicitly set it to `fetch` or `clone` to fetch the sourcecode by the runner EXAMPLE VALUE: `none`
`CRAY_CUDA_MPS`	Enable running with nvidia-mps-server, which allows multiple ranks sharing the same GPU. This variable is optional and the default value is NO. EXAMPLE VALUE: `NO`
`USE_MPI`	Enable running with MPI hooks enabled. This allows to inject the host MPI library inside the container runtime for native MPI speed. This variable is optional and the default value is `AUTO` , where it is set to `YES`, if you run with more than 1 rank, otherwise `NO`. EXAMPLE VALUE: `YES`
`CSCS_ADDITIONAL_MOUNTS`	This allows mounting user defined host directories inside the container. The value must be a valid JSON array, where each entry is of the form: `<host-path>:<container-path>` This variable is optional and by default empty. EXAMPLE VALUE: `'["/path/to/data/on/host/machine:/path/inside/container", "/path/to/something/else:/another/path/inside/the/container"]'`

The following two examples demonstrate CI jobs for running with a container image

job1:
  extends: .container-runner-daint-gh200
  image: $CSCS_REGISTRY_PATH/aarch64/my_image:$CI_COMMIT_SHORT_SHA
  script:
    - /usr/bin/my_application /data/some_input.xml
  variables:
    CSCS_ADDITIONAL_MOUNTS: '["/capstor/scratch/cscs/<my_username>/data:/data"]'

job2:
  extends: .container-runner-eiger-zen2
  image: $CSCS_REGISTRY_PATH/x86_64/my_image:$CI_COMMIT_SHORT_SHA 
  script:
    - /usr/bin/my_application ./data_in_repository.txt
  variables:
    GIT_STRATEGY: fetch

container-runner-lightweight

This runner allows lightweight jobs that do not need many resources. The advantage is that the job is not running via SLURM and can therefore start faster. The maximum timeout for this runner is 60 minutes and you can request at most only 4 CPUs and 4GB of memory. If your job does not fit these requirements, then you must use the default container-runner.

Typical examples of when this runner is the right choice (not limited to these use cases though):

Upload code coverage artifacts
Create a dynamic pipeline yaml file

Currently only a runner on Zen2 microarchitecture exists.

The naming for the runner is container-runner-lightweight-<MICROARCHITECTURE>

The following runners are available:

.container-runner-lightweight-zen2

This runner is restricted to public images. It is not possible to run an image that cannot be pulled anonymously. If you have built a container image in a previous stage and stored it in $CSCS_REGISTRY_PATH, then you must ensure that it is in a subdirectory with the name public, i.e., the image path must match the wildcard $CSCS_REGISTRY_PATH/**/public/**.

You can set the CPU and memory requests/limits with variables. A request specifies the minimum amount of resources that your job requires. Your job will not be scheduled until the requested resources are available. A limit is the maximum that your job might be able to use if available, but the job is not guaranteed to be allocated that limit.

Variable	Additional information
`KUBERNETES_CPU_REQUEST`	Number of CPUs minimally needed to schedule this job. This variable is optional and the default value is 1. EXAMPLE VALUE: `1`
`KUBERNETES_CPU_LIMIT`	Limit the job to use at most that many CPUs. This variable is optional and the default value is 1. EXAMPLE VALUE: `1`
`KUBERNETES_MEMORY_REQUEST`	The amount of memory minimally needed to schedule the job. This variable is optional and the default value is 1Gi. EXAMPLE VALUE: `1Gi`
`KUBERNETES_MEMORY_LIMIT`	Limit the job to use at most this much memory. You will get an OOM (out-of-memory) error, if you exceed the limit. This variable is optional and the default value is 1Gi. EXAMPLE VALUE: `1Gi`

The following example demonstrates running a lightweight containerized job:

job:
  extends: .container-runner-lightweight-zen2
  image: docker.io/python:3.11
  script:
    - ci/pipeline/generate_pipeline.py > dynamic_pipeline.yaml
  artifacts:
    paths:
      - dynamic_pipeline.yaml

uenv-builder

This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.

The naming for the runner is uenv-builder-<CLUSTERNAME>-<MICROARCHITECTURE>

The following runners are available:

uenv-builder-eiger-zen2
uenv-builder-daint-gh200

uenv-builder is very similar to the container-builder , the main difference is that you are building a uenv based on a recipe directory instead of Dockerfile.

The uenv will be registered under the name $UENV_NAME/$UENV_VERSION:$UENV_TAG.

A uenv will only be rebuilt, if there is no uenv already registered under that name.

The tag's default value is calculated as a hash from the contents of your uenv recipe yaml files, which ensures that a uenv is rebuilt every time the content of the recipe's yaml files changes. Additionally to the computed hash value, the uenv image will also be registered under the name $UENV_NAME/$UENV_VERSION:$CI_PIPELINE_ID, which allows to refer to the image in subsequent uenv-runner -jobs.

Variable	Additional information
`UENV_NAME`	The name of the uenv The default is empty, and you must explicitly set it to a name. Use alpha-numeric characters, and `-_.`. EXAMPLE VALUE: `prgenv-gnu`
`UENV_VERSION`	The version of the uenv The default is empty, and you must explicitly set it to a version. Use alpha-numeric characters, and `-_.`. EXAMPLE VALUE: `24.10`
`UENV_RECIPE`	The path to the directory containing the recipe yaml files. The default is empty and you must explicitly set it to the relative directory in your repository. EXAMPLE VALUE: `ci/uenv-recipes/prgenv-gnu/gh200`
`UENV_TAG`	Optional parameter. By default the tag will be a computed hash of the contents in `$UENV_RECIPE`. You can set to an explicit (changing) tag, if you want to opt-out of the feature that a uenv is only rebuilt, when the contents of the recipe yaml files changes.

The following two examples demonstrate CI jobs for building a uenv

job1:
  extends: .uenv-builder-eiger-zen2
  variables:
    UENV_NAME: prgenv-gnu
    UENV_VERSION: 24.10
    UENV_RECIPE: ci/uenv-recipes/prgenv-gnu/eiger-zen2

job2:
  extends: .uenv-builder-daint-gh200
  variables:
    UENV_NAME: prgenv-gnu
    UENV_VERSION: 24.10
    UENV_RECIPE: ci/uenv-recipes/prgenv-gnu/daint-gh200

uenv-runner

This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.

The naming for the runner is uenv-runner-<CLUSTERNAME>-<MICROARCHITECTURE>

The following runners are available:

uenv-runner-eiger-zen2
uenv-runner-daint-gh200

uenv-runner is very similar to the container-runner, the main difference is that you are running with a uenv image mounted instead of inside a container.

The uenv image is specified in the tag image in the job yaml. This tag is mandatory. Supported variables are:

Variable

Additional information

WITH_UENV_VIEW

This is the view to load together with the uenv.

The default is empty, and you must explicitly set it to a valid view corresponding to the uenv.

EXAMPLE VALUE: modules

CRAY_CUDA_MPS

Enable running with nvidia-mps-server, which allows multiple ranks sharing the same GPU.

This variable is optional and the default value is NO.

EXAMPLE VALUE: NO

The following two examples demonstrate CI jobs for running with a uenv image

job1:
  extends: .uenv-runner-eiger-zen2
  image: prgenv-gnu/24.7:v3
  script:
    - gcc --version
  variables:
    WITH_UENV_VIEW: 'default'

job2:
  extends: .uenv-runner-daint-gh200
  image:  gromacs/2024:v1
  script:
    - gmx_mpi --version
  variables:
    WITH_UENV_VIEW: 'gromacs'
    SLURM_JOB_NUM_NODES: 1
    SLURM_NTASKS: 4

baremetal-runner

This runner submits SLURM jobs through FirecREST. See the comments below and the FirecREST documentation for additional FirecREST information.

The naming for the runner is baremetal-runner-<CLUSTERNAME>-<MICROARCHITECTURE>

The following runners are available:

baremetal-runner-eiger-zen2
baremetal-runner-daint-gh200

This runner mode is almost equivalent to writing an SBATCH script. Instead of #SBATCH instructions, you need to use the SLURM_* variables to specify your SLURM requirements. Otherwise all commands are as typical with SBATCH executed only on the first node, unless you run it explicitly via srun with multiple ranks.

The following example demonstrates running a lightweight containerized job:

job:
  extends: .baremetal-runner-daint-gh200
  script:
    - hostname
    - srun --uenv prgenv-gnu/24.7:v3 --view=default gcc --version
  variables:
    SLURM_JOB_NUM_NODES: 1

f7t-controller

This runner allows submitting jobs to clusters using FirecREST. With this runner, all the dependencies for submitting jobs with FirecREST are already available in the environment. You can either use the client tool firecrest, or a python script that uses the pyfirecrest library. When the job starts, the runner will set four environment variables, which are needed to allow submitting jobs through FirecREST. These variables are:

AUTH_TOKEN_URL : This is the same value as the variable F7T_TOKEN_URL in the job description
FIRECREST_URL : This is the same value as the variable F7T_URL in the job description
FIRECREST_CLIENT_ID : The value that is set the CI setup page in the admin section
FIRECREST_CLIENT_SECRET : The value that is set in the CI setup page in the admin section

A job can be submitted with the client, e.g. via

firecrest submit --system eiger --account $CSCS_CI_DEFAULT_SLURM_ACCOUNT my_script.sh

The following example demonstrates submitting a CI job and waiting for it to finish:

job:
  extends: .f7t-controller
  script:
    - CLUSTER=eiger
    - SUBMISSION="$(firecrest submit --system $CLUSTER --account $CSCS_CI_DEFAULT_SLURM_ACCOUNT script.sh)"
    - JOBID=$(echo "$SUBMISSION" | grep "jobid" | sed -e 's/.*jobid[^0-9]*\([0-9]\+\),/\1/')
    - |
      while firecrest poll-active --raw --system $CLUSTER | grep $JOBID ; do
        echo "job is still in queue/running"
        sleep 30
      done

reframe-runner

This runner will run ReFrame. ReFrame jobs are submitted with FirecREST. This runner is a thin wrapper over the f7t-controller. The machine where reframe is running does not have to be a powerful machine, hence it does not make sense to start the reframe process from a compute node. It makes more sense to start the reframe process on a weaker machine and submit the compute jobs through FirecREST to the actual cluster.

Variable

Additional information

RFM_VERSION

This reframe version will be available on the machine where the job is running.

This variable is optional and defaults to a recent version of ReFrame

EXAMPLE VALUE: 4.6.2

RFM_CONFIG

The path to the config that is passed to reframe through -C.

This variable is mandatory, unless you overwrite the script tag.

EXAMPLE VALUE: relative/path/to/config.py

RFM_CHECKPATH

The path to the checks that is passed to reframe through -c.

This variable is mandatory, unless you overwrite the script tag.

EXAMPLE VALUE: relative/path/to/checks

The default command that is executed is

 reframe -C $RFM_CONFIG -c $RFM_CHECKPATH -Sbuild_locally=0 --report-junit=report.xml -r

This default can be overwritten, by providing a user-defined script tag in the job.

The following example demonstrates submitting a CI job to run ReFrame.

job:
  before_script:
    - git clone -b alps https://github.com/eth-cscs/cscs-reframe-tests
    - pip install -r cscs-reframe-tests/config/utilities/requirements.txt
    - sed -i -e "s/account=csstaff/account=$CSCS_CI_DEFAULT_SLURM_ACCOUNT/" cscs-reframe-tests/config/systems-firecrest/eiger.py
  variables:
    FIRECREST_SYSTEM: 'eiger'
    FIRECREST_BASEDIR: /capstor/scratch/cscs/jenkssl/reframe-runner
    RFM_FIRECREST: '1'
    RFM_CONFIG: cscs-reframe-tests/config/cscs.py
    RFM_CHECKPATH: cscs-reframe-tests/checks/microbenchmarks/mpi/halo_exchange

The easiest way to use the FirecREST scheduler of ReFrame is to use the configuration files that are provided in the alps branch of this repository: https://github.com/eth-cscs/cscs-reframe-tests. In case you want to run ReFrame for a system that is not already available in this directory, please open a ticket to the Service Desk and we will add it or help you update one of the existing ones.

Something you should be aware of when running with this scheduler is that ReFrame will not have direct access to the filesystem of the cluster so the stage directory will need to be kept in sync through FirecREST. It is recommended to try to clean the stage directory whenever possible with the postrun_cmds and postbuild_cmds and to avoid autodetection of the processor in each run. Normally ReFrame stores these files in ~/.reframe/topology/{system}-{part}/processor.json, but you get a "clean" runner every time. You could either add them in the configuration files or store the files in the first run and copy them to the right directory before ReFrame runs.

Finally, you can find some more information here: https://github.com/eth-cscs/cscs-reframe-tests/blob/alps/config/systems-firecrest/README.md

FirecREST

This is not a runner per se, but since most runners are built on top of FirecREST some relevant notes how CI is interacting with FirecREST.

CI will submit jobs with the FirecREST client id/secret that have been stored at the CI setup page at https://cicd-ext-mw.cscs.ch (in the Admin section). Storing the client id/secret is mandatory, because most runners will not work without these credentials.

The credentials are tied to a CSCS username, hence the pipeline will run within the context of this user. It is possible and encouraged to request with a Service Desk ticket a CI service account. Then the FirecREST credentials can be tied to the CI service account.

You will always need 4 pieces of information to interact with FirecREST:

Token dispenser URL
API endpoint URL
Client ID
Client Secret

In the CI context the token dispenser URL is passed with the variable F7T_TOKEN_URL, the API endpoint is passed with the variable F7T_URL. The client ID/Secret are stored in the CI setup page. The client ID/Secret can be overridden on a per-job basis. The variables for overriding are F7T_CLIENT_ID and F7T_CLIENT_SECRET.

In a nutshell, the client ID and client secret are used to request from the token dispenser URL an access token. The token dispenser will reply with an access token, if and only if the client ID/secret pair is valid. This access token is then used to authenticate the API requests that are being sent to the FirecREST API endpoint.

The documented runners above, set the correct F7T_TOKEN_URL and F7T_URL for the respective clusters. When you are running on the f7t-controller runner, then you might have to modify the default variables, because this runner is not targeting a specific cluster, but it can target different clusters in the same job. Targeting different clusters in the same job can require to provide different F7T_URL. The F7T_TOKEN_URL is currently the same for any cluster.

Content

Space Tools

container-builder

container-runner

container-runner-lightweight

uenv-builder

uenv-runner

baremetal-runner

f7t-controller

reframe-runner

FirecREST

Content

Space Tools

Breadcrumbs

Runners

container-builder

container-runner

container-runner-lightweight

uenv-builder

uenv-runner

baremetal-runner

f7t-controller

reframe-runner

FirecREST