CSCS provides the GREASY meta scheduler to manage high throughput simulations on Piz Daint: GREASY was developed by the Barcelona SuperComputing Center to simplify the execution of embarrassingly parallel simulations in any environment.

It is primarily designed to run serial applications, but there is a custom version on Piz Daint that is able to handle serial, MPI, MPI+OpenMP and OpenMP only applications. Please, note that this is a modified version of GREASY and contains features that are not present in the original. Although every functionality of the original version exists as well in this one, there are differences in the way the greasy command is run. For instance, on Piz Daint, there is an additional environment variable (GREASY_NWORKERS_PER_NODE) to control the number of workers per node, which also controls the number of MPI ranks per worker.

GREASY on Eiger

GREASY might not work properly on the Cray EX supercomputer, due to an issue with the Slurm workload manager: the issue has been reported to HPE/Cray support, but unfortunately there is no workaround yet

Licensing Terms and Conditions

GREASY is a free software released under the GPL-3 version.

Setup

You can see a list of the available versions of the program installed on the machine after loading the gpu or multicore modulefile. In the examples below we use the daint-gpu modulefile:

module load daint-gpu 
module avail GREASY

The following module command will load the environment of the default version of the program:

module load GREASY

You can either type this command every time you intend to use the program within a new session, or you can automatically load it by including it in your shell configuration file. The following module commands will print the environment variables set by loading the program and a help message:

module show GREASY 
module help GREASY

GREASY at CSCS requires password-less logins. Please follow the steps described in password-less access and compute nodes access

How to Run on Piz Daint

To submit a GREASY job, you need a Slurm submission script and a tasks file for GREASY. To run 4 serial tasks in parallel with GREASY, for instance, we write the tasks file with one task per line, like this:

[@ /path/to/folder_task1/ @] executable1.x
[@ /path/to/folder_task2/ @] executable2.x
[@ /path/to/folder_task3/ @] executable3.x
[@ /path/to/folder_task4/ @] executable4.x

In the custom version of GREASY on Piz Daint, [@ /path/to/folder_task/ @] is used to specify the folder in which the task will be executed. Note that if the path is not provided, all tasks will be executed in the submission folder.

Serial tasks

On Piz Daint, GREASY by default runs 1 process per task. The 4 serial tasks can be run in parallel with a Slurm submission script like the following

#!/bin/bash -l
#SBATCH --job-name=greasy_serial_tasks
#SBATCH --account="<project>"
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=normal
#SBATCH --constraint=gpu
#SBATCH --gres=gpu:0,craynetwork:4

module load daint-gpu
module load GREASY

export CRAY_CUDA_MPS=1
export CUDA_VISIBLE_DEVICES=0
export GPU_DEVICE_ORDINAL=0

greasy tasks.txt

Please replace the string with the ID of the active project that will be charged for the allocation. In this example, setting --ntasks=4 in the submission script is equivalent to --ntasks-per-node=4. Notice that there is no srun before the greasy command! The srun command will be called internally by GREASY.

The additional #SBATCH --gres=gpu:0,craynetwork:4 option in the job script is currently needed for a workaround to allow the GREASY workers in the same node to share the GPU. This is not needed for jobs running in the mc partition. Moreover, the environment variables CRAY_CUDA_MPS, CUDA_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL should be set to the values shown in the example.

With this Slurm submission script we are requesting 1 node and 4 cpus. When the job is submitted, GREASY will check the requested resources and run one process per task. If the resources that we requested are not enough to fit all tasks, GREASY will put the tasks that do not fit on hold and schedule them as soon as there are free resources again. In contrast, if we request more resources than the number of serial tasks, a number of cpus will be unused. We can make sure that we are properly distributing the tasks among the resources by checking the GREASY log file.

The GREASY log file is named greasy-${SLURM_JOB_ID}.log and contains useful information about the execution of the tasks, such as percentage of available resources used. The line [2017-10-30 10:22:00] INFO: Resource Utilization: 50% refers to GREASY's scheduling overhead. As soon as each task lasts longer than a few seconds, the resource utilization should be very close to 100%. This figure does not refer to the percentage of available resources used, which would be 33% in the example above, since only 4 out 12 CPUs were used.

If we look for the string Allocating in the log file, we can see the time stamps when each task started. In this example, we can check that indeed all tasks started at the same time:

[2017-10-30 10:21:48] INFO: Allocating task 1
[2017-10-30 10:21:48] INFO: Allocating task 2
[2017-10-30 10:21:48] INFO: Allocating task 3
[2017-10-30 10:21:48] INFO: Allocating task 4

One of the main advantages here is that many jobs can be submitted to the queue with only one call to sbatch.

Multithreading with OpenMP

Each node in the GPU partition on Piz Daint consists of 12 cpus. If the 4 tasks defined above benefit from multithreading, we could run them in one node using 3 cpus each instead of only 1. To do that, we only need to set --cpus-per-task=3 in the submission script above and define the enviroment variables OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK and GREASY_NWORKERS_PER_NODE=$SLURM_NTASKS_PER_NODE.

The following submission script accomplishes that:

#!/bin/bash -l
#SBATCH --job-name=greasy_multithreading
#SBATCH --account="<project>"
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --partition=normal
#SBATCH --constraint=gpu
#SBATCH --gres=gpu:0,craynetwork:4

module load daint-gpu
module load GREASY

export CRAY_CUDA_MPS=1
export CUDA_VISIBLE_DEVICES=0
export GPU_DEVICE_ORDINAL=0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export GREASY_NWORKERS_PER_NODE=$SLURM_NTASKS_PER_NODE

greasy tasks.txt

Again, it is important to pay attention to the correspondence between the number of GREASY tasks and the allocated resources. We can combine nodes and tasks in different ways according to the requirements of the jobs. In this case, the number of GREASY workers per node is the same as the number of tasks per node, meaning that each GREASY task will only have a single MPI rank available, which can spawn 3 OpenMP threads (--cpus-per-task=3). We could also use, for instance, --nodes=8, --ntasks-per-node=2 and --cpus-per-task=6 if we would like to run 2 tasks on each node with 6 OpenMP threads.

MPI and MPI+OpenMP

Each node in the GPU partition on Piz Daint consists of 12 cpus. If the 4 tasks defined above benefit from multithreading and MPI, we could run them in one node using 3 cpus and 2 MPI ranks each. To do that, we need only to set in the submission script above --cpus-per-task=3 and define the enviroment variables OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK and GREASY_NWORKERS_PER_NODE=2.

The following submission script accomplishes that:

#!/bin/bash -l
#SBATCH --job-name=greasy_mpi
#SBATCH --account="<project>"
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --partition=normal
#SBATCH --constraint=gpu
#SBATCH --gres=gpu:0,craynetwork:4

module load daint-gpu
module load GREASY

export CRAY_CUDA_MPS=1
export CUDA_VISIBLE_DEVICES=0
export GPU_DEVICE_ORDINAL=0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export GREASY_NWORKERS_PER_NODE=2

greasy tasks.txt

Again, it is important to pay attention to the correspondence between the number of GREASY tasks and the allocated resources. We can combine nodes and tasks in different ways according to the requirements of the jobs. In this case, the number of GREASY workers per node is half the number of tasks per node, this means that each GREASY task will have two MPI ranks available and they will be able to spawn 3 OpenMP threads (--cpus-per-task=3) each. We could also use for instance, --nodes=8, --ntasks-per-node=2, --cpus-per-task=6 and export GREASY_NWORKERS_PER_NODE=1 if we wanted to run 1 task on each node with 2 MPI ranks and 6 OpenMP threads. Please note however that GREASY does not support tasks running on more than one node, i.e. all MPI ranks of a certain GREASY task must be in the same node.

Further Documentation

GREASY on GitHub

GREASY user guide