Bugs caused by asynchronous communication¶
Issue 1¶
See the issue raised in ticket 7719.
A user was running an in-house CFD code and got very strange issues when spreading over 2 nodes, but using 4 GCDs on each node. For some combinations of GCDs, NaNs were frequently produced (0, 2, 4, 6 worked correctly though which are the ones that are not connected directly to the NICs). No issues were observed in the cPU-only version of the code.
In the end, it turned out to be a code error and not a bug in Cray MPI. There was a missing device synchronisation and sometimes the send buffers were not completely updated in GPU memory before MPI performed the send.
To quote from the user: This would explain many of the observations:
- MPI_WaitAll does not raise any error because MPI does actually complete the communications; simply the buffers were not updated.
- This obviously cannot occur on CPU.
- This was a silent bug on their local supercomputer using NVIDIA GPUs. I read that CUDA is more robust with regards to synchronization aspects compared to ROCm. Nevertheless, this is still unsafe and we should fix it on CUDA side as well.
- This can apparently also depends on MPI stacks, hence probably the reason why it started to appear after the maintenance on LUMI.
- The results were dependent on the GCDs arrangement, and this would make sense considering that communications can be scheduled differently then.