A batch job summary report is often requested in project proposals at CSCS to demonstrate the effective use of GPUs. In previous systems, such as Piz Daint, a report was automatically included in the slurm output file. Currently on ALPS, this system is not in place and we are providing a separate tool jobreport
to generate the report.
jobreport
is used in two stages. The first stage monitors an application and records the GPU usage statistics. The monitoring stage must be executed within a slurm environment. The information is recorded as .csv
data within a directory jobreport_${SLURM_JOB_ID}
or a directory supplied on the command line. The second stage prints this information in a tabular form that can be inserted into a project proposal.
Downloading the Executable
Precompiled binary for the jobreport utility can be obtained directly from GitHub here
https://github.com/eth-cscs/alps-jobreport/releases
or via the command line:
$ wget https://github.com/eth-cscs/alps-jobreport/releases/download/v0.1/jobreport $ chmod +x ./jobreport
Command Line Options
A full list of command line options with explanations can be found using jobreport --help
As of v0.1 the command line options are:
Usage: jobreport [-v -h] [subcommand] -- COMMAND Options: -h, --help Show this help message -v, --version Show version information Subcommands: monitor Monitor the performance metrics for a job. (Default) -h, --help Shows help message -o, --output <path> Specify output directory (default: ./jobreport_<SLURM_JOB_ID>) -u, --sampling_time <seconds> Set the time between samples (default: automatically determined) -t, --max_time <time> Set the maximum monitoring time (format: DD-HH:MM:SS, default: 24:00:00) print Print a job report -h, --help Shows help message -o, --output <path> Output path for the report file container-hook Write enroot hook for jobreport -h, --help Shows help message -o, --output <path> Output path for the enroot hook file (default: $HOME/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh) Arguments: COMMAND The command to run as the workload
Reported Information
The final output from jobreport is a table summarizing the most important details of how your application used the compute resources during its execution. The report is divided into two parts: A general summary and GPU specific values.
Summary of Job Statistics
- Job Id - The Slurm job id.
- Step Id - The slurm step id. A job step in SLURM is a subdivision of a job started with
srun
. - User - The user account that submitted the job.
- SLURM Account - The project account that will be billed.
- Start Time, End Time, Elapsed Time - The time the job started and ended, and how long it ran.
- Number of Nodes - The number of nodes allocated to the job.
- Number of GPUs - The number of GPUs allocated to the job.
- Total Energy Consumed - The total energy consumed based on the average power usage (below) over the elapsed time.
- Average Power Usage - The average power draw over the elapsed time in Watts (W), summed over all GPUs.
- Average SM Utilization - The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel, averaged over all GPUs.
- Average Memory Utilization - The percentage of a process's lifetime during which global (device) memory was being read or written, averaged over all GPUs.
GPU Specific Values
- Host - The compute node executing a job step.
- GPU - The GPU id on a node.
- Elapsed - The elapsed time.
- SM Utilization % - The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel.
- Memory Utilization % - The percentage of process's lifetime during which global (device) memory was being read or written.
Examples
In the examples we use the following definitions:
${JOBREPORT}
is the path of thejobreport
executable, which you are free to choose.${SRUN}
is set tosrun -A <account>
${SBATCH}
is set tosbatch -A <account>
${COMMAND}$
is set tosleep 5
, which won't use the GPU but will show the basic functionality of the tool.
The simplest example to test jobreport
is to use run it with the sleep
command. It is important to separate {JOBREPORT}
(andit's options) and ${COMMAND}
with --
.
$ ${SRUN} -t 5:00 --nodes=1 ${JOBREPORT} -- ${COMMAND} $ ls jobreport_16133 $ ${JOBREPORT} print jobreport_16133 Summary of Job Statistics +-----------------------------------------+-----------------------------------------+ | Job Id | 16133 | +-----------------------------------------+-----------------------------------------+ | Step Id | 0 | +-----------------------------------------+-----------------------------------------+ | User | jpcoles | +-----------------------------------------+-----------------------------------------+ | SLURM Account | unknown_account | +-----------------------------------------+-----------------------------------------+ | Start Time | 03-07-2024 15:32:24 | +-----------------------------------------+-----------------------------------------+ | End Time | 03-07-2024 15:32:29 | +-----------------------------------------+-----------------------------------------+ | Elapsed Time | 5s | +-----------------------------------------+-----------------------------------------+ | Number of Nodes | 1 | +-----------------------------------------+-----------------------------------------+ | Number of GPUs | 4 | +-----------------------------------------+-----------------------------------------+ | Total Energy Consumed | 0.5 Wh | +-----------------------------------------+-----------------------------------------+ | Average Power Usage | 348.8 W | +-----------------------------------------+-----------------------------------------+ | Average SM Utilization | 0% | +-----------------------------------------+-----------------------------------------+ | Average Memory Utilization | 0% | +-----------------------------------------+-----------------------------------------+ GPU Specific Values +---------------+------+------------------+------------------+----------------------+ | Host | GPU | Elapsed | SM Utilization % | Memory Utilization % | | | | | (avg/min/max) | (avg/min/max) | +---------------+------+------------------+------------------+----------------------+ | nid006212 | 0 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | | nid006212 | 1 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | | nid006212 | 2 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | | nid006212 | 3 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | +---------------+------+------------------+------------------+----------------------+
slurm sbatch
The simplest example of running jobreport
in an slurm job submission script. The report printing, can also be included directory and does not need the srun
command.
$ cat submit.sh #!/bin/bash #SBATCH -t 5:00 #SBATCH --nodes=2 srun ${JOBREPORT} -o report -- ${COMMAND} ${JOBREPORT} print report $ ${SBATCH} submit.sh
When used within an job script, jobreport
will work across multiple calls to srun
. Each time srun
is called, slurm creates a new job step and jobreport
records data for each one. Multiple job steps running simultaneously are also allowed. The job report generated contains sections for each slurm job step.
$ cat submit.sh #!/bin/bash #SBATCH -t 5:00 #SBATCH --nodes=2 srun ${JOBREPORT} -o report -- ${COMMAND} srun ${JOBREPORT} -o report -- ${COMMAND} srun --nodes=1 ${JOBREPORT} -o report -- ${COMMAND} & srun --nodes=1 ${JOBREPORT} -o report -- ${COMMAND} & wait $ ${SBATCH} submit.sh
UENV
jobreport
is agnostic to the loaded uenv.
The following example runs a program called burn
that computes repeated matrix multiplications to stress the GPUs. It was built in, and requires to run, the uenv image prgenv-gnu/24.2:v1
.
$ ${SRUN} --uenv=prgenv-gnu/24.2:v1 -t 5:00 --nodes=1 --ntasks-per-node=4 --gpus-per-task=1 ${JOBREPORT} -o report -- ./burn --gpu=gemm -d 30 $ ${JOBREPORT} print report Summary of Job Statistics +-----------------------------------------+-----------------------------------------+ | Job Id | 15923 | +-----------------------------------------+-----------------------------------------+ | Step Id | 0 | +-----------------------------------------+-----------------------------------------+ | User | jpcoles | +-----------------------------------------+-----------------------------------------+ | SLURM Account | unknown_account | +-----------------------------------------+-----------------------------------------+ | Start Time | 03-07-2024 14:54:48 | +-----------------------------------------+-----------------------------------------+ | End Time | 03-07-2024 14:55:25 | +-----------------------------------------+-----------------------------------------+ | Elapsed Time | 36s | +-----------------------------------------+-----------------------------------------+ | Number of Nodes | 1 | +-----------------------------------------+-----------------------------------------+ | Number of GPUs | 4 | +-----------------------------------------+-----------------------------------------+ | Total Energy Consumed | 18.7 Wh | +-----------------------------------------+-----------------------------------------+ | Average Power Usage | 1.8 kW | +-----------------------------------------+-----------------------------------------+ | Average SM Utilization | 88% | +-----------------------------------------+-----------------------------------------+ | Average Memory Utilization | 43% | +-----------------------------------------+-----------------------------------------+ GPU Specific Values +---------------+------+------------------+------------------+----------------------+ | Host | GPU | Elapsed | SM Utilization % | Memory Utilization % | | | | | (avg/min/max) | (avg/min/max) | +---------------+------+------------------+------------------+----------------------+ | nid007044 | 0 | 36s | 83 / 0 / 100 | 39 / 0 / 50 | | nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 50 | | nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 48 | | nid007044 | 0 | 36s | 90 / 0 / 100 | 47 / 0 / 54 | +---------------+------+------------------+------------------+----------------------+
Container Engine
Running jobreport
within the container engine requires a little more setup to allow the CE to mount the required GPU library paths inside the container. First, a script, described below, needed to setup the mount points needs to be placed in ${HOME}/.config/enroot/hooks.d/
. Second, the following configuration needs to be added to the .toml
file corresponding to the container in ${HOME}/.edf
[annotations] com.hooks.dcgm.enabled = "true"
jobreport
itself can generate the necessary script for mounting the GPU libraries in the container. By default, the script will be placed in ${HOME}/.config/enroot/hooks.d/cscs_jobreport.sh
$ ${JOBREPORT} container-hook Writing enroot hook to "/users/jpcoles/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh" Add the following to your container .toml file: [annotations] com.hooks.dcgm.enabled = "true"
Once the CE is configured, only the new .toml
(here ubuntu-jobreport
) needs to be referenced along with a call to jobreport
:
http_proxy=http://proxy.cscs.ch:8080 https_proxy=http://proxy.cscs.ch:8080 ${SRUN} --environment=ubuntu-jobreport ${JOBREPORT} -- sleep 5
Frequently Asked Questions
Why is my command ignored when using jobreport?
Most likely the call to jobreport and your command were not separated with --
. Without this jobreport will interpret (and ignore) the command as its own option.
Why is there no output from jobreport?
If the slurm maximum time is reached before the application can complete gracefully (or itself crash) jobreport will not be able to write any output. Please ensure that the application can finish before slurm terminates the allocation.
Known Issues
Occasionally on some compute nodes the power will not be read properly resulting in the output in the report:
+-----------------------------------------+-----------------------------------------+ | Total Energy Consumed | *Failed to measure* | +-----------------------------------------+-----------------------------------------+ | Average Power Usage | *Failed to measure* | +-----------------------------------------+-----------------------------------------+
Currently, there is an issue when generating the report file via jobreport print
on the macOS terminal:
what(): locale::facet::_S_create_c_locale name not valid /var/spool/slurmd/job32394/slurm_script: line 21: 199992 Aborted (core dumped) ./jobreport print report
To fix this follow these steps:
- Open the terminal application
- In the top-left corner menu select Terminal -> Settings
- Select your default profile
- Uncheck "Set locale environment variables on startup"
- Quit and reopen the terminal and try again. This should fix the issue.