Relevant for GH200 nodes

Overview

To share GPU card(s) between multiple MPI ranks on the same node, you currently need to manually start an Nvidia multi-process daemon call the CUDA Multi-Process Service (MPS). On previous machines this was enabled by setting the environment variable CRAY_CUDA_MPS, but this is no longer supported.

Wrapper Script

Use the following simple wrapper script to start the CUDA Multi-Process Service (MPS):

MPS wrapper script
#!/bin/bash
# Example mps-wrapper.sh usage:
# > srun --cpu-bind=socket [srun args] mps-wrapper.sh [cmd] [cmd args]

# only this path is supported by MPS
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$(id -un)
# Launch MPS from a single rank per node
if [[ $SLURM_LOCALID -eq 0 ]]; then
    CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d
fi

# set cuda device
numa_nodes=$(hwloc-calc --physical --intersect NUMAnode $(taskset -p $$ | awk '{print "0x"$6}'))
export CUDA_VISIBLE_DEVICES=$numa_nodes

# Wait for MPS to start
sleep 1
# Run the command
numactl --membind=$numa_nodes "$@"
result=$?
# Quit MPS control daemon before exiting
if [[ $SLURM_LOCALID -eq 0 ]]; then
    echo quit | nvidia-cuda-mps-control
fi
exit $result

Example Usage

An example of using the wrapper script above is provided in the following sample slurm script:

Batch submission script
#!/bin/bash -l
#SBATCH --job-name=<job_name>
#SBATCH --time=01:30:00 #HH:MM:SS
#SBATCH --nodes=2
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 #32 MPI ranks per node
#SBATCH --cpus-per-task=8 #8 OMP threads per rank
#SBATCH --account=<account> 
#SBATCH --hint=nomultithread 
#SBATCH --hint=exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MPICH_MALLOC_FALLBACK=1

ulimit -s unlimited

srun --cpu-bind=socket ./mps-wrapper.sh <code> <args>

Further Documentation

Full documentation on CUDA MPS is available here.