Relevant for GH200 nodes
Overview
To share GPU card(s) between multiple MPI ranks on the same node, you currently need to manually start an Nvidia multi-process daemon call the CUDA Multi-Process Service (MPS). On previous machines this was enabled by setting the environment variable CRAY_CUDA_MPS
, but this is no longer supported.
Wrapper Script
Use the following simple wrapper script to start the CUDA Multi-Process Service (MPS):
MPS wrapper script
#!/bin/bash # Example mps-wrapper.sh usage: # > srun --cpu-bind=socket [srun args] mps-wrapper.sh [cmd] [cmd args] # only this path is supported by MPS export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$(id -un) # Launch MPS from a single rank per node if [[ $SLURM_LOCALID -eq 0 ]]; then CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d fi # set cuda device numa_nodes=$(hwloc-calc --physical --intersect NUMAnode $(taskset -p $$ | awk '{print "0x"$6}')) export CUDA_VISIBLE_DEVICES=$numa_nodes # Wait for MPS to start sleep 1 # Run the command numactl --membind=$numa_nodes "$@" result=$? # Quit MPS control daemon before exiting if [[ $SLURM_LOCALID -eq 0 ]]; then echo quit | nvidia-cuda-mps-control fi exit $result
Example Usage
An example of using the wrapper script above is provided in the following sample slurm script:
Batch submission script
#!/bin/bash -l #SBATCH --job-name=<job_name> #SBATCH --time=01:30:00 #HH:MM:SS #SBATCH --nodes=2 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=32 #32 MPI ranks per node #SBATCH --cpus-per-task=8 #8 OMP threads per rank #SBATCH --account=<account> #SBATCH --hint=nomultithread #SBATCH --hint=exclusive export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MPICH_MALLOC_FALLBACK=1 ulimit -s unlimited srun --cpu-bind=socket ./mps-wrapper.sh <code> <args>
Further Documentation
Full documentation on CUDA MPS is available here.