All CSCS systems use the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:

  • sbatch - submit a batch script
  • squeue - check the status of jobs on the system
  • scancel - delete one of your jobs from the queue

Furthermore, the list of queues and partitions is available typing sinfo or scontrol show partition and past jobs saved in the Slurm database can be inspected with the sacct command: please have a look at man sacct for more information.

An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH (as a note for PBS batch system users, this is the Slurm equivalent of #PBS). A suitable batch script is then submitted to the batch system using the sbatch command: let us remind you that a valid account must be provided using the Slurm option --account=<project> or -A <project>, where the string <project is the ID of the active project that will be charged for the allocation. Please note that the selected project will be charged per node usage on most systems, whether you submit batch jobs with sbatch or you allocate interactive sessions on compute nodes with salloc or using the srun command directly: please have a look at the corresponding man pages for more details. Interactive allocations (sessions started with salloc or srun) are meant for debugging purposes only and have a limited wall time duration.

A basic Slurm batch script can be written just adding the --ntasks and --time directives, but extra directives will give you more control on how your job is run.

Output and Error

The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>.out where <SLURM_JOB_ID> is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err: both files will be found in the directory from which you launched the job.

Note that with Slurm the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing on the output files: this means that the output of your job will not appear in the output files immediately. If you want to override this behaviour, you should pass the option -u or --unbuffered to the srun command: the output will then appear in the file as soon as it is produced.

If you wish to change the default names of the output and error files, you can use the --output and --error directives in the batch script that you submit using the sbatch command. See the example below:

#!/bin/bash -l
#
#SBATCH --job-name="hello_world_mpi"
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=18
#SBATCH --output=hello_world_mpi.%j.o
#SBATCH --error=hello_world_mpi.%j.e
#SBATCH --account=<project>

srun ./hello_world_mpi.x

The string <project> is the ID of the active project that will be charged for the allocation.

Slurm best practices on CSCS Cray systems

Cray systems at CSCS are equipped with Slurm: the Cray command aprun has been replaced by the Slurm command srun. The migration from Cray ALPS (Application Level Placement Scheduler) to native Slurm is supported by the simple examples available below for the most common usage with MPI and hybrid MPI/OpenMP jobs. Slurm man pages (e.g man sbatch) will give useful information and more details on specific options, along with the Slurm documentation.

Please note that the Slurm scheduling system is a shared resource that can handle a limited amount of batch jobs and interactive commands simultaneously. Therefore users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time, as doing so would infringe our policy of fair share usage of resources. If your production workflow requires you to submit a large amount of jobs to the queueing system concurrently, please contact us beforehand. You might consider using the GREASY meta-scheduler approach to manage the execution of several independent parallel simulations.

Advanced users might also be interested in consulting the presentations available on line from the Slurm User Group meeting, covering the features of the latest Slurm release.

#!/bin/bash -l
#
#SBATCH --nodes=4
#SBATCH --ntasks=48
#SBATCH --ntasks-per-node=12
#SBATCH --ntasks-per-core=2
#SBATCH --cpus-per-task=2
#SBATCH --constraint=gpu
#SBATCH --time=00:30:00
#SBATCH --account=<project>

# tasks: $SLURM_NTASKS
# tasks-per-node: $SLURM_NTASKS_PER_NODE
# cpus-per-task: $SLURM_CPUS_PER_TASK
srun --cpu_bind=rank ./myprogram.x

The example above shows an MPI job allocated on four nodes using hyperthreading on Piz Daint. You need to use the Slurm option --constraint=gpu to run on the XC50 with the GPU accelerator on each node. If you select the XC40 with the alternative option --constraint=mc, the same srun options apply, but you will need to adjust the number of cores of the Broadwell compute node, featuring two sockets with eighteen cores each. The flag -l at the beginning allows you to call the module command within the script, in case you need it.

#!/bin/bash -l
#
#SBATCH --time=00:30:00
#SBATCH --nodes=4
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=18
#SBATCH --constraint=mc
#SBATCH --account=<project>

# tasks: $SLURM_NTASKS
# tasks-per-node: $SLURM_NTASKS_PER_NODE
# cpus-per-task: $SLURM_CPUS_PER_TASK
srun --hint=nomultithread ./test.mpi

The Slurm script above shows how to run a hybrid job with two MPI tasks per node, spawning eighteen threads per socket on a two-sockets Broadwell compute node. The srun option --hint=nomultithread would avoid using extra threads with in-core multi-threading, a configuration that can benefit communication intensive applications (see man srun for further details).

Slurm release 22 and newer (not affecting Piz Daint)

The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable

Electrical groups

The Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes per group on Piz Daint: applications that can run on a single group might experience a slowdown if the Slurm scheduler does not allocate the nodes on the same electrical group.

Users might prevent this situation by adding the Slurm flag #SBATCH --switches=1 to their batch script, which defines the maximum count of switches desired for the job allocation. We strongly advise to manually set a maximum waiting time for the selected number of switches, e.g. #SBATCH --switches=1@01:00:00 will make the scheduler wait for maximum one hour before ignoring the switches request. In fact, if Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with the desired switch count or the time limit expires. Please note that the default maximum waiting time is currently 24 hours, see scontrol show config | grep max_switch_wait. Please have a look at man sbatch for more information.

Large jobs

The example below shows the commands required to run large MPI jobs:

  • PMI_MMAP_SYNC_WAIT_TIME and srun option --wait prevent Slurm from killing tasks that take long time to run
  • srun's option --bcast copies the binary to /tmp on all nodes before launching them. This helps task startup time
export PMI_MMAP_SYNC_WAIT_TIME=300
srun --wait 200 --bcast=/tmp/hello-world $HOME/jobs/bin/daint/hello-world

Batch Job Summary Report

A summary report will be appended at the end of any batch job creating output files (interactive jobs do not create Slurm output files). We explain below in detail the information displayed in the summary report.

Header

The report header displays the Job Name and the JobId, together with the name of the computing system:

Batch Job Summary Report (version 21.01.1) for Job "rfm_Cp2kGpuCheck_large_prod_job" (30212936) on daint
 

Timeline

  • Submit : Job submission time
  • Eligible : Job eligible to run (after dependencies or other constraints are resolved)
  • Start : Job start time
  • End : Job end time
  • Elapsed : Wall time duration
  • TimeLimit : Job time limit as requested at submission
Job information (1/3)
-----------------------------------------------------------------------------------------------------
             Submit            Eligible               Start                 End    Elapsed Time limit
------------------- ------------------- ------------------- ------------------- ---------- ----------
2021-03-30T16:58:52 2021-03-30T16:58:52 2021-03-30T17:01:25 2021-03-30T17:04:20   00:02:55   00:10:00
-----------------------------------------------------------------------------------------------------

Job details

  • Username : Username of the job owner
  • Account : Account with resources for the job
  • Partition : batch system partition used
  • NNodes : Number of nodes allocated
  • Energy : Estimated energy consumption
Job information (2/3)
-------------------------------------------------------------
    Username      Account    Partition   NNodes        Energy
------------ ------------ ------------ -------- -------------
     lucamar      csstaff       normal       16    436.243 kJ

GPU statistics

For jobs using GPUs, a report on the GPU usage is enclosed.

Job information (3/3) - GPU utilization data
----------------------------------------------------
   Node name       Usage      Max mem Execution time
------------ ----------- ------------ --------------
    nid05765        33 %     1245 MiB       00:02:33
    nid06059        34 %     1245 MiB       00:02:33
    nid05766        33 %     1249 MiB       00:02:33
    nid05769        33 %     1243 MiB       00:02:33
    nid06055        34 %     1245 MiB       00:02:33
    nid06062        34 %     1245 MiB       00:02:33
    nid05770        34 %     1245 MiB       00:02:33
    nid06057        34 %     1251 MiB       00:02:33
    nid06056        33 %     1245 MiB       00:02:33
    nid05768        33 %     1247 MiB       00:02:33
    nid06061        34 %     1251 MiB       00:02:33
    nid06058        33 %     1245 MiB       00:02:33
    nid06060        34 %     1245 MiB       00:02:33
    nid05771        33 %     1245 MiB       00:02:33
    nid05767        33 %     1243 MiB       00:02:33
    nid05764        33 %     1247 MiB       00:02:33

  • Usage : Percent of time over the process's lifetime during which one or more kernels was executing on the GPU
  • Max mem : Maximum total memory in megabytes that was ever allocated by the process
  • Execution time : Amount of time during which the compute context was active
Please note that the batch job summary report may not contain the GPU utilisation for all nodes

Synoptic table

The table below reports the corresponding options of the former Cray command aprun and the current Slurm command srun:

Optionaprunsrun
MPI tasks-n-n, --ntasks
MPI tasks per node-N--ntasks-per-node
CPUs per task-d-c, --cpus-per-task
Thread/task affinity-cc cpu--cpu_bind=rank
Large memory nodes-q bigmem--mem=120GB