This page documents known MPI issues found on Alps systems, along with workarounds when available.
Existing Issues
Cray MPICH hangs
Cray MPICH on Grace-Hopper systems may hang on larger runs.
Workaround
There are many possible reasons why an application would hang, many unrelated to MPICH. However, if you are experiencing hangs the issue may be worked around by setting:
export FI_MR_CACHE_MONITOR=disabled
Performance may be negatively affected by this option.
Resolved issues
"cxil_map: write error" when doing inter-node GPU-aware MPI communication
Fix
The issue has been resolved by a system update on and the workaround is no longer needed. The issue was caused by a system misconfiguration.
When doing inter-node GPU-aware communication with Cray MPICH after the October 2024 update on Alps, applications will fail with:
cxil_map: write error
MPI_THREAD_MULTIPLE does not work
Fix
When using MPI_THREAD_MULTIPLE on Grace-Hopper systems Cray MPICH may fail with an assertion that looks similar to:
Assertion failed [...]: (&MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX)->count == 0
or
Assertion failed [...]: MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX.count == 0