All CSCS systems use the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:
sbatch
- submit a batch scriptsqueue
- check the status of jobs on the systemscancel
- delete one of your jobs from the queue
Furthermore, the list of queues and partitions is available typing sinfo
or scontrol show partition
and past jobs saved in the Slurm database can be inspected with the sacct
command: please have a look at man sacct
for more information.
An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH
(as a note for PBS batch system users, this is the Slurm equivalent of #PBS
). A suitable batch script is then submitted to the batch system using the sbatch
command: let us remind you that a valid account must be provided using the Slurm option --account=<project>
or -A <project>
, where the string <project
is the ID of the active project that will be charged for the allocation. Please note that the selected project will be charged per node usage on most systems, whether you submit batch jobs with sbatch
or you allocate interactive sessions on compute nodes with salloc
or using the srun
command directly: please have a look at the corresponding man
pages for more details. Interactive allocations (sessions started with salloc
or srun
) are meant for debugging purposes only and have a limited wall time duration.
A basic Slurm batch script can be written just adding the --ntasks
and --time
directives, but extra directives will give you more control on how your job is run.
Output and Error
The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>.out
where <SLURM_JOB_ID>
is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err
: both files will be found in the directory from which you launched the job.
Note that with Slurm the output file is created as soon as your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing on the output files: this means that the output of your job will not appear in the output files immediately. If you want to override this behaviour, you should pass the option -u
or --unbuffered
to the srun
command: the output will then appear in the file as soon as it is produced.
If you wish to change the default names of the output and error files, you can use the --output
and --error
directives in the batch script that you submit using the sbatch
command. See the example below:
#!/bin/bash -l # #SBATCH --job-name="hello_world_mpi" #SBATCH --time=00:05:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=18 #SBATCH --output=hello_world_mpi.%j.o #SBATCH --error=hello_world_mpi.%j.e #SBATCH --account=<project> srun ./hello_world_mpi.x
The string <project>
is the ID of the active project that will be charged for the allocation.
Slurm best practices on CSCS Cray systems
Cray systems at CSCS are equipped with Slurm: the Cray command aprun
has been replaced by the Slurm command srun
. The migration from Cray ALPS (Application Level Placement Scheduler) to native Slurm is supported by the simple examples available below for the most common usage with MPI and hybrid MPI/OpenMP jobs. Slurm man pages (e.g man sbatch
) will give useful information and more details on specific options, along with the Slurm documentation.
Please note that the Slurm scheduling system is a shared resource that can handle a limited amount of batch jobs and interactive commands simultaneously. Therefore users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time, as doing so would infringe our policy of fair share usage of resources. If your production workflow requires you to submit a large amount of jobs to the queueing system concurrently, please contact us beforehand. You might consider using the GREASY meta-scheduler approach to manage the execution of several independent parallel simulations.
Advanced users might also be interested in consulting the presentations available on line from the Slurm User Group meeting, covering the features of the latest Slurm release.
#!/bin/bash -l # #SBATCH --nodes=4 #SBATCH --ntasks=48 #SBATCH --ntasks-per-node=12 #SBATCH --ntasks-per-core=2 #SBATCH --cpus-per-task=2 #SBATCH --constraint=gpu #SBATCH --time=00:30:00 #SBATCH --account=<project> # tasks: $SLURM_NTASKS # tasks-per-node: $SLURM_NTASKS_PER_NODE # cpus-per-task: $SLURM_CPUS_PER_TASK srun --cpu_bind=rank ./myprogram.x
The example above shows an MPI job allocated on four nodes using hyperthreading on Piz Daint. You need to use the Slurm option --constraint=gpu
to run on the XC50 with the GPU accelerator on each node. If you select the XC40 with the alternative option --constraint=mc
, the same srun options apply, but you will need to adjust the number of cores of the Broadwell compute node, featuring two sockets with eighteen cores each. The flag -l
at the beginning allows you to call the module command within the script, in case you need it.
#!/bin/bash -l # #SBATCH --time=00:30:00 #SBATCH --nodes=4 #SBATCH --ntasks=8 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=18 #SBATCH --constraint=mc #SBATCH --account=<project> # tasks: $SLURM_NTASKS # tasks-per-node: $SLURM_NTASKS_PER_NODE # cpus-per-task: $SLURM_CPUS_PER_TASK srun --hint=nomultithread ./test.mpi
The Slurm script above shows how to run a hybrid job with two MPI tasks per node, spawning eighteen threads per socket on a two-sockets Broadwell compute node. The srun option --hint=nomultithread
would avoid using extra threads with in-core multi-threading, a configuration that can benefit communication intensive applications (see man srun
for further details).
Slurm release 22 and newer (not affecting Piz Daint)
The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task
when calling srun, or by setting the SRUN_CPUS_PER_TASK
environment variable
Electrical groups
The Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes per group on Piz Daint: applications that can run on a single group might experience a slowdown if the Slurm scheduler does not allocate the nodes on the same electrical group.
Users might prevent this situation by adding the Slurm flag #SBATCH --switches=1
to their batch script, which defines the maximum count of switches desired for the job allocation. We strongly advise to manually set a maximum waiting time for the selected number of switches, e.g. #SBATCH --switches=1@01:00:00
will make the scheduler wait for maximum one hour before ignoring the switches
request. In fact, if Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with the desired switch count or the time limit expires. Please note that the default maximum waiting time is currently 24 hours, see scontrol show config | grep max_switch_wait
. Please have a look at man sbatch
for more information.
Large jobs
The example below shows the commands required to run large MPI jobs:
- PMI_MMAP_SYNC_WAIT_TIME and srun option
--wait
prevent Slurm from killing tasks that take long time to run srun
's option--bcast
copies the binary to/tmp
on all nodes before launching them. This helps task startup time
export PMI_MMAP_SYNC_WAIT_TIME=300 srun --wait 200 --bcast=/tmp/hello-world $HOME/jobs/bin/daint/hello-world
Batch Job Summary Report
A summary report will be appended at the end of any batch job creating output files (interactive jobs do not create Slurm output files). We explain below in detail the information displayed in the summary report.
Header
The report header displays the Job Name
and the JobId
, together with the name of the computing system:
Batch Job Summary Report (version 21.01.1) for Job "rfm_Cp2kGpuCheck_large_prod_job" (30212936) on daint
Timeline
- Submit : Job submission time
- Eligible : Job eligible to run (after dependencies or other constraints are resolved)
- Start : Job start time
- End : Job end time
- Elapsed : Wall time duration
- TimeLimit : Job time limit as requested at submission
Job information (1/3) ----------------------------------------------------------------------------------------------------- Submit Eligible Start End Elapsed Time limit ------------------- ------------------- ------------------- ------------------- ---------- ---------- 2021-03-30T16:58:52 2021-03-30T16:58:52 2021-03-30T17:01:25 2021-03-30T17:04:20 00:02:55 00:10:00 -----------------------------------------------------------------------------------------------------
Job details
- Username : Username of the job owner
- Account : Account with resources for the job
- Partition : batch system partition used
- NNodes : Number of nodes allocated
- Energy : Estimated energy consumption
Job information (2/3) ------------------------------------------------------------- Username Account Partition NNodes Energy ------------ ------------ ------------ -------- ------------- lucamar csstaff normal 16 436.243 kJ
GPU statistics
For jobs using GPUs, a report on the GPU usage is enclosed.
Job information (3/3) - GPU utilization data ---------------------------------------------------- Node name Usage Max mem Execution time ------------ ----------- ------------ -------------- nid05765 33 % 1245 MiB 00:02:33 nid06059 34 % 1245 MiB 00:02:33 nid05766 33 % 1249 MiB 00:02:33 nid05769 33 % 1243 MiB 00:02:33 nid06055 34 % 1245 MiB 00:02:33 nid06062 34 % 1245 MiB 00:02:33 nid05770 34 % 1245 MiB 00:02:33 nid06057 34 % 1251 MiB 00:02:33 nid06056 33 % 1245 MiB 00:02:33 nid05768 33 % 1247 MiB 00:02:33 nid06061 34 % 1251 MiB 00:02:33 nid06058 33 % 1245 MiB 00:02:33 nid06060 34 % 1245 MiB 00:02:33 nid05771 33 % 1245 MiB 00:02:33 nid05767 33 % 1243 MiB 00:02:33 nid05764 33 % 1247 MiB 00:02:33
- Usage : Percent of time over the process's lifetime during which one or more kernels was executing on the GPU
- Max mem : Maximum total memory in megabytes that was ever allocated by the process
- Execution time : Amount of time during which the compute context was active
Synoptic table
The table below reports the corresponding options of the former Cray command aprun
and the current Slurm command srun
:
Option | aprun | srun |
---|---|---|
MPI tasks | -n | -n, --ntasks |
MPI tasks per node | -N | --ntasks-per-node |
CPUs per task | -d | -c, --cpus-per-task |
Thread/task affinity | -cc cpu | --cpu_bind=rank |
Large memory nodes | -q bigmem | --mem=120GB |