All CSCS systems use the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:

  • sbatch -A<project> script.sh - submit the batch script script.sh using the budget of <project> 
  • squeue -u $USER - check the status of $USER jobs on the system
  • scancel jobID - remove the your job with Slurm ID jobID from the queue

Furthermore, the list of queues and partitions is available typing sinfo or scontrol show partition and past jobs saved in the Slurm database can be inspected with the sacct command: please have a look at man sacct for more information.

Please don't run squeue without options as it impacts the performance of the scheduler to list all jobs in the queue. You can select a subset of the jobs using squeue options like -A, --account=<account_list>  or -u, --user=<user_list> . Please have a look at man squeue  for more options, for instance --me   Equivalent to --user=<my username> 


An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH (as a note for PBS batch system users, this is the Slurm equivalent of #PBS).
A suitable batch script is then submitted to the batch system using the sbatch command: let us remind you that a valid account must be provided using the Slurm option --account=<project> or -A <project>, where the string <project> is the ID of the active project that will be charged for the allocation. 

On Daint.Alps , the user's primary account will be charged by default if the Slurm option -A / --account  is not set


Please note that the selected project will be charged per node usage on most systems, whether you submit batch jobs with sbatch or you allocate interactive sessions on compute nodes with salloc or using the srun command directly: please have a look at the corresponding man pages for more details. Interactive allocations (sessions started with salloc or srun) are meant for debugging purposes only and have a limited wall time duration.

A basic Slurm batch script can be written just adding the --ntasks and --time directives, but extra directives will give you more control on how your job is run.

Output and Error

The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>.out where <SLURM_JOB_ID> is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err: both files will be found in the directory from which you launched the job.

Note that with Slurm the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing on the output files: this means that the output of your job will not appear in the output files immediately. If you want to override this behaviour, you should pass the option -u or --unbuffered to the srun command: the output will then appear in the file as soon as it is produced.

If you wish to change the default names of the output and error files, you can use the --output and --error directives in the batch script that you submit using the sbatch command. See the example below:

#!/bin/bash -l
#
#SBATCH --job-name="hello_world_mpi"
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=18
#SBATCH --output=hello_world_mpi.%j.o
#SBATCH --error=hello_world_mpi.%j.e
#SBATCH --account=<project>

srun ./hello_world_mpi.x

The string <project> is the ID of the active project that will be charged for the allocation.

Slurm best practices on CSCS Cray systems

Cray systems at CSCS are equipped with Slurm: the Cray command aprun has been replaced by the Slurm command srun. The migration from Cray ALPS (Application Level Placement Scheduler) to native Slurm is supported by the simple examples available below for the most common usage with MPI and hybrid MPI/OpenMP jobs. Slurm man pages (e.g man sbatch) will give useful information and more details on specific options, along with the Slurm documentation.

Please note that the Slurm scheduling system is a shared resource that can handle a limited amount of batch jobs and interactive commands simultaneously. Therefore users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time, as doing so would infringe our policy of fair share usage of resources. If your production workflow requires you to submit a large amount of jobs to the queueing system concurrently, please contact us beforehand. You might consider using the GREASY meta-scheduler approach to manage the execution of several independent parallel simulations.

Advanced users might also be interested in consulting the presentations available on line from the Slurm User Group meeting, covering the features of the latest Slurm release.

#!/bin/bash -l
#
#SBATCH --nodes=4
#SBATCH --ntasks=48
#SBATCH --ntasks-per-node=12
#SBATCH --ntasks-per-core=2
#SBATCH --cpus-per-task=2
#SBATCH --constraint=gpu
#SBATCH --time=00:30:00
#SBATCH --account=<project>

# tasks: $SLURM_NTASKS
# tasks-per-node: $SLURM_NTASKS_PER_NODE
# cpus-per-task: $SLURM_CPUS_PER_TASK
srun --cpu_bind=rank ./myprogram.x

The example above shows an MPI job allocated on four nodes using hyperthreading on Piz Daint. You need to use the Slurm option --constraint=gpu to run on the XC50 with the GPU accelerator on each node. If you select the XC40 with the alternative option --constraint=mc, the same srun options apply, but you will need to adjust the number of cores of the Broadwell compute node, featuring two sockets with eighteen cores each. The flag -l at the beginning allows you to call the module command within the script, in case you need it.

#!/bin/bash -l
#
#SBATCH --time=00:30:00
#SBATCH --nodes=4
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=18
#SBATCH --constraint=mc
#SBATCH --account=<project>

# tasks: $SLURM_NTASKS
# tasks-per-node: $SLURM_NTASKS_PER_NODE
# cpus-per-task: $SLURM_CPUS_PER_TASK
srun --hint=nomultithread ./test.mpi

The Slurm script above shows how to run a hybrid job with two MPI tasks per node, spawning eighteen threads per socket on a two-sockets Broadwell compute node. The srun option --hint=nomultithread would avoid using extra threads with in-core multi-threading, a configuration that can benefit communication intensive applications (see man srun for further details).

Slurm release 22 and newer

The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable

Large jobs

The example below shows the commands required to run large MPI jobs:

  • PMI_MMAP_SYNC_WAIT_TIME and srun option --wait prevent Slurm from killing tasks that take long time to run
  • srun's option --bcast copies the binary to /tmp on all nodes before launching them. This helps task startup time
export PMI_MMAP_SYNC_WAIT_TIME=300
srun --wait 200 --bcast=/tmp/hello-world $HOME/jobs/bin/daint/hello-world

Synoptic table

The table below reports the corresponding options of the former Cray command aprun and the current Slurm command srun:

Optionaprunsrun
MPI tasks-n-n, --ntasks
MPI tasks per node-N--ntasks-per-node
CPUs per task-d-c, --cpus-per-task
Thread/task affinity-cc cpu--cpu_bind=rank
Large memory nodes-q bigmem--mem=120GB