Hands-on: Converting the PyTorch single GPU AI training job to use all GPUs in a single node via DDP¶
Exercises on the course GitHub.
Q&A¶
-
I'm not sure if this is the right section. I'm getting often following error when running job on small-g 4 out of 8 GPUs.
Any idea why? It happens randomly. Yes.. we usetorch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error, NCCL version 2.16.5 ncclInternalError: Internal check failed. Last error: Failed to find reverse path from remNode 0/c1000 nlinks 4 to node 0/d9000
export CUDA_VISIBLE_DEVICES=$ROCR_VISIBLE_DEVICES
-
How are you setting device visibility? If using ROCR_VISIBLE_DEVICES, try use HIP_VISIBLE_DEVICES instead.
-
We have seen this error before but are not sure what's the reason for it. It is indeed not predictable and it only seems to happen on small-g and dev-g when reserving multiple GPUs but not the full node exclusively. We were suspecting it has to do with assumptions on CPU-GPU binding in PyTorch/NCCL that do not work well when partial nodes are reserved. But I guess the "good" thing about it's randomness is that you can re-run your job and then it probably works... -Lukas
-