Skip to content

Bugs caused by asynchronous communication

Issue 1

See the issue raised in ticket 7719.

A user was running an in-house CFD code and got very strange issues when spreading over 2 nodes, but using 4 GCDs on each node. For some combinations of GCDs, NaNs were frequently produced (0, 2, 4, 6 worked correctly though which are the ones that are not connected directly to the NICs). No issues were observed in the cPU-only version of the code.

In the end, it turned out to be a code error and not a bug in Cray MPI. There was a missing device synchronisation and sometimes the send buffers were not completely updated in GPU memory before MPI performed the send.

To quote from the user: This would explain many of the observations:

  • MPI_WaitAll does not raise any error because MPI does actually complete the communications; simply the buffers were not updated.
  • This obviously cannot occur on CPU.
  • This was a silent bug on their local supercomputer using NVIDIA GPUs. I read that CUDA is more robust with regards to synchronization aspects compared to ROCm. Nevertheless, this is still unsafe and we should fix it on CUDA side as well.
  • This can apparently also depends on MPI stacks, hence probably the reason why it started to appear after the maintenance on LUMI.
  • The results were dependent on the GCDs arrangement, and this would make sense considering that communications can be scheduled differently then.