This page documents known MPI issues found on Alps systems, along with workarounds when available.

Existing Issues

Cray MPICH hangs

Cray MPICH on Grace-Hopper systems may hang on larger runs.

Workaround

There are many possible reasons why an application would hang, many unrelated to MPICH. However, if you are experiencing hangs the issue may be worked around by setting:

export FI_MR_CACHE_MONITOR=disabled

Performance may be negatively affected by this option.

Resolved issues

"cxil_map: write error" when doing inter-node GPU-aware MPI communication

Fix

The issue has been resolved by a system update on and the workaround is no longer needed. The issue was caused by a system misconfiguration.

When doing inter-node GPU-aware communication with Cray MPICH after the October 2024 update on Alps, applications will fail with:

cxil_map: write error


The only workaround is to not use inter-node GPU-aware MPI.

For users of CP2K encountering this issue, one can disable the use of COSMA, which uses GPU-aware MPI, by placing the follwing in the &GLOBAL section of your input file: 

&FM
TYPE_OF_MATRIX_MULTIPLICATION SCALAPACK
&END FM

Unless you run RPA calculations, this should have limited impact on performance.

MPI_THREAD_MULTIPLE does not work

Fix

The issue has been resolved in Cray MPICH version 8.1.30.

When using MPI_THREAD_MULTIPLE on Grace-Hopper systems Cray MPICH may fail with an assertion that looks similar to:

 Assertion failed [...]: (&MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX)->count == 0

or

Assertion failed [...]: MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX.count == 0

The issue can be worked around by falling back to a less optimized implementation of MPICH_THREAD_MULTIPLE by setting:

export MPICH_OPT_THREAD_SYNC=0



  • No labels