To use GPU-aware MPI on Alps systems one must use Cray MPICH available through user environments. Most user environments provided by CSCS for GPU systems link to the required GTL libraries to use GPU-aware MPI. Use of GPU-aware MPI is always recommended for best communication performance on GPU systems.
In order to tell Cray MPICH to actually enable GPU-aware MPI support at runtime, the MPICH_GPU_SUPPORT_ENABLED=1
environment variable must be set, e.g. by exporting it in a SLURM batch script. On some CSCS systems this will be set by default for all users. Linking to the required libraries and enabling support for GPU-aware MPI has no impact on applications that do not use the feature.
If you attempt to communicate GPU buffers through MPI without setting MPICH_GPU_SUPPORT_ENABLED=1
, it will lead to segmentation faults, usually without any indication that it is the communication that fails.
If MPICH_GPU_SUPPORT_ENABLED
is set to 1
and your application does not link against one of the GTL libraries you will get a clear error during MPI initialization, similar to:
MPICH ERROR [Rank 0] [job id 410301.1] [Thu Feb 13 12:42:18 2025] [nid005414] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked (Other MPI error) aborting job: MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
If your application does not use GPU-aware MPI and you do not wish to link to the GTL library, either unset MPICH_GPU_SUPPORT_ENABLED
or explicitly set it to 0
in your launch scripts.
If you intended to use GPU-aware MPI, make sure you are using an up to date user environment with support for GPU-aware MPI. Most user environments with CUDA support will automatically link to the GTL library.
To check if your application is linked against the required GTL library, running ldd
on your executable should print something similar to:
$ ldd myexecutable | grep gtl libmpi_gtl_cuda.so => /user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/cray-gtl-8.1.30-fptqzc5u6t4nals5mivl75nws2fb5vcq/lib/libmpi_gtl_cuda.so (0x0000ffff82aa0000)
The path may be different, but the libmpi_gtl_cuda.so
library should be printed when using CUDA. In ROCm environments the libmpi_gtl_hsa.so
should be linked.