Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Notice the section [annotations] disabling Slurm and CXI hooks.

MPICH unable to allocate shared memory

We're aware of an issue affecting open source MPICH libraries from using shared memory when launching multiple ranks per node, and we're currently investigating.
The problem usually manifests with the following error message:

Code Block
Abort(73000719) on node 5: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306)..........: MPI_Init(argc=0xffffeba3326c, argv=0xffffeba33260) failed
MPII_Init_thread(265).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(800).........: 
MPIR_Comm_commit_internal(585): 
MPID_Comm_commit_pre_hook(151): 
MPIDI_world_pre_init(633).....: 
MPIDU_Init_shm_init(179)......: unable to allocate shared memory

The issue should affect only intra-node communication, not inter-node communication. In other words, running 1 MPI rank per node should work correctly.


Further reading