All CSCS systems use the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:
sbatch -A<project> script.sh
- submit the batch scriptscript.sh
using the budget of<project>
squeue -u $USER
- check the status of$USER
jobs on the systemscancel jobID
- remove the your job with Slurm IDjobID
from the queue
Furthermore, the list of queues and partitions is available typing sinfo
or scontrol show partition
and past jobs saved in the Slurm database can be inspected with the sacct
command: please have a look at man sacct
for more information.
Please don't run squeue
without options as it impacts the performance of the scheduler to list all jobs in the queue. You can select a subset of the jobs using squeue
options like -A, --account=<account_list>
or -u, --user=<user_list>
. Please have a look at man squeue
for more options, for instance --me Equivalent to --user=<my username>
An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH
(as a note for PBS batch system users, this is the Slurm equivalent of #PBS
).
A suitable batch script is then submitted to the batch system using the sbatch
command: let us remind you that a valid account must be provided using the Slurm option --account=<project>
or -A <project>
, where the string <project>
is the ID of the active project that will be charged for the allocation.
On Daint.Alps
, the user's primary account will be charged by default if the Slurm option -A / --account
is not set
Please note that the selected project will be charged per node usage on most systems, whether you submit batch jobs with sbatch
or you allocate interactive sessions on compute nodes with salloc
or using the srun
command directly: please have a look at the corresponding man
pages for more details. Interactive allocations (sessions started with salloc
or srun
) are meant for debugging purposes only and have a limited wall time duration.
A basic Slurm batch script can be written just adding the --ntasks
and --time
directives, but extra directives will give you more control on how your job is run.
Output and Error
The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>.out
where <SLURM_JOB_ID>
is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err
: both files will be found in the directory from which you launched the job.
Note that with Slurm the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing on the output files: this means that the output of your job will not appear in the output files immediately. If you want to override this behaviour, you should pass the option -u
or --unbuffered
to the srun
command: the output will then appear in the file as soon as it is produced.
If you wish to change the default names of the output and error files, you can use the --output
and --error
directives in the batch script that you submit using the sbatch
command. See the example below:
#!/bin/bash -l # #SBATCH --job-name="hello_world_mpi" #SBATCH --time=00:05:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=18 #SBATCH --output=hello_world_mpi.%j.o #SBATCH --error=hello_world_mpi.%j.e #SBATCH --account=<project> srun ./hello_world_mpi.x
The string <project>
is the ID of the active project that will be charged for the allocation.
Slurm best practices on CSCS Cray systems
Cray systems at CSCS are equipped with Slurm: the Cray command aprun
has been replaced by the Slurm command srun
. The migration from Cray ALPS (Application Level Placement Scheduler) to native Slurm is supported by the simple examples available below for the most common usage with MPI and hybrid MPI/OpenMP jobs. Slurm man pages (e.g man sbatch
) will give useful information and more details on specific options, along with the Slurm documentation.
Please note that the Slurm scheduling system is a shared resource that can handle a limited amount of batch jobs and interactive commands simultaneously. Therefore users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time, as doing so would infringe our policy of fair share usage of resources. If your production workflow requires you to submit a large amount of jobs to the queueing system concurrently, please contact us beforehand. You might consider using the GREASY meta-scheduler approach to manage the execution of several independent parallel simulations.
Advanced users might also be interested in consulting the presentations available on line from the Slurm User Group meeting, covering the features of the latest Slurm release.
#!/bin/bash -l # #SBATCH --nodes=4 #SBATCH --ntasks=48 #SBATCH --ntasks-per-node=12 #SBATCH --ntasks-per-core=2 #SBATCH --cpus-per-task=2 #SBATCH --constraint=gpu #SBATCH --time=00:30:00 #SBATCH --account=<project> # tasks: $SLURM_NTASKS # tasks-per-node: $SLURM_NTASKS_PER_NODE # cpus-per-task: $SLURM_CPUS_PER_TASK srun --cpu_bind=rank ./myprogram.x
The example above shows an MPI job allocated on four nodes using hyperthreading on Piz Daint. You need to use the Slurm option --constraint=gpu
to run on the XC50 with the GPU accelerator on each node. If you select the XC40 with the alternative option --constraint=mc
, the same srun options apply, but you will need to adjust the number of cores of the Broadwell compute node, featuring two sockets with eighteen cores each. The flag -l
at the beginning allows you to call the module command within the script, in case you need it.
#!/bin/bash -l # #SBATCH --time=00:30:00 #SBATCH --nodes=4 #SBATCH --ntasks=8 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=18 #SBATCH --constraint=mc #SBATCH --account=<project> # tasks: $SLURM_NTASKS # tasks-per-node: $SLURM_NTASKS_PER_NODE # cpus-per-task: $SLURM_CPUS_PER_TASK srun --hint=nomultithread ./test.mpi
The Slurm script above shows how to run a hybrid job with two MPI tasks per node, spawning eighteen threads per socket on a two-sockets Broadwell compute node. The srun option --hint=nomultithread
would avoid using extra threads with in-core multi-threading, a configuration that can benefit communication intensive applications (see man srun
for further details).
Slurm release 22 and newer
The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task
when calling srun, or by setting the SRUN_CPUS_PER_TASK
environment variable
Large jobs
The example below shows the commands required to run large MPI jobs:
- PMI_MMAP_SYNC_WAIT_TIME and srun option
--wait
prevent Slurm from killing tasks that take long time to run srun
's option--bcast
copies the binary to/tmp
on all nodes before launching them. This helps task startup time
export PMI_MMAP_SYNC_WAIT_TIME=300 srun --wait 200 --bcast=/tmp/hello-world $HOME/jobs/bin/daint/hello-world
Synoptic table
The table below reports the corresponding options of the former Cray command aprun
and the current Slurm command srun
:
Option | aprun | srun |
---|---|---|
MPI tasks | -n | -n, --ntasks |
MPI tasks per node | -N | --ntasks-per-node |
CPUs per task | -d | -c, --cpus-per-task |
Thread/task affinity | -cc cpu | --cpu_bind=rank |
Large memory nodes | -q bigmem | --mem=120GB |