This page documents known MPI issues found on Alps systems, along with workarounds when available.
"cxil_map: write error" when doing inter-node GPU-aware MPI communication
When doing inter-node GPU-aware communication with Cray MPICH after the October 2024 update on Alps, applications will fail with:
cxil_map: write error
Workaround
Until the issue is resolved properly on the system, the only workaround is to not use GPU-aware MPI.
For users of CP2K encountering this issue, one can disable the use of COSMA, which uses GPU-aware MPI, by placing the follwing in the &GLOBAL section of your input file:
&FM TYPE_OF_MATRIX_MULTIPLICATION SCALAPACK &END FM
Unless you run RPA calculations, this should have limited impact on performance
MPI_THREAD_MULTIPLE does not work
The issue has been resolved in Cray MPICH version 8.1.30.
When using MPI_THREAD_MULTIPLE on Grace-Hopper systems Cray MPICH may fail with an assertion that looks similar to:
Assertion failed [...]: (&MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX)->count == 0
or
Assertion failed [...]: MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX.count == 0
Workaround
The issue can be worked around by falling back to a less optimized implementation of MPICH_THREAD_MULTIPLE by setting:
export MPICH_OPT_THREAD_SYNC=0
MPICH hangs
Cray MPICH on Grace-Hopper systems may hang on larger runs.
Workaround
There are many possible reasons why an application would hang, many unrelated to MPICH. However, if you are experiencing hangs the issue may be worked around by setting:
export FI_MR_CACHE_MONITOR=disabled
Performance may be negatively affected by this option.