Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We're aware of an issue affecting containers with open source MPICH libraries from using shared memory when launching multiple ranks per node, and we're currently investigating. We have identified the cause of the issue and are working to apply a fix.
The problem usually manifests with the following error message:

...

The issue should affect only intra-node communication, not inter-node communication. In other words, running 1 MPI rank per node should work correctly.

A temporary workaround consists in ensuring that the PMI_RANK environment variable inside the container corresponds to the actual MPI rank of the process within the job step.
This can be achieved by wrapping the original command for the container within a bash command like shown below:

Code Block
srun <srun options> --environment=<name or path> bash -c 'export PMI_RANK=${SLURM_PROCID}; <regular container command>'

Further reading