Skip to content

LUMI training materials

Hands-on: Converting the PyTorch single GPU AI training job to use all GPUs in a single node via DDP

LUMI training materials

Home
User Updates
User Updates
- August-September 2024
  August-September 2024
  - FAQ
  - Documentation links
- October-November 2023
- August 2023
  August 2023
User Coffee Breaks
User Coffee Breaks
Performance Analysis & Optimization October 2025
Performance Analysis & Optimization October 2025
- Schedule
- Course materials
Supercomputing with LUMI October 2025
Supercomputing with LUMI October 2025
- Schedule
- Course materials
AI workshop October 2025
AI workshop October 2025
- Schedule
- Course materials
EasyBuild June 2025
EasyBuild June 2025
- Course notes
Supercomputing with LUMI June 2025
Supercomputing with LUMI June 2025
- Schedule
- Course materials
AI workshop May 2025
AI workshop May 2025
- Schedule
- Course materials
Hackathon May 2025
Hackathon May 2025
Profiling May 2025
Profiling May 2025
LUMI intensive March 2025
LUMI intensive March 2025
- Schedule
- Course materials
Partner courses
Partner courses
- GROMACS workshop 2024
Archive
Archive
- AI workshop February 2025
  AI workshop February 2025
  - Schedule
  - Course materials
- Supercomputing with LUMI December 2024
  Supercomputing with LUMI December 2024
  - Schedule
  - Course materials
- AI workshop November 2024
  AI workshop November 2024
  - Schedule
  - Course materials
- Advanced LUMI October 2024
  Advanced LUMI October 2024
  - Schedule
  - Course materials
- Hackathon October 2024
  Hackathon October 2024
- Profiling October 2024
  Profiling October 2024
- Performance Analysis & Optimization June 2024
  Performance Analysis & Optimization June 2024
  - Schedule
  - Course materials
- Supercomputing with LUMI May 2024
  Supercomputing with LUMI May 2024
  - Schedule
  - Course materials
- AI workshop May 2024
  AI workshop May 2024
  - Schedule
- Comprehensive LUMI April 2024
  Comprehensive LUMI April 2024
  - Schedule
- 1-day February 2024
  1-day February 2024
- Profiling November 2023
  Profiling November 2023
- Comprehensive LUMI October 2023
  Comprehensive LUMI October 2023
- 1-day September 2023
  1-day September 2023
- Comprehensive LUMI May-June 2023
  Comprehensive LUMI May-June 2023
- 1-day May 2023
  1-day May 2023
- Hackathon April 2023
  Hackathon April 2023
- Profiling April 2013
  Profiling April 2013
- Comprehensive LUMI February 2023
  Comprehensive LUMI February 2023
- LUMI-G January 2023
  LUMI-G January 2023
- PEAP-Q November 2022
  PEAP-Q November 2022
  - Schedule
- LUMI-G August 2022
  LUMI-G August 2022
  - hackmd notes
- EasyBuild May 2022
  EasyBuild May 2022
  - Course notes
- PEAP-Q April 2022
  PEAP-Q April 2022
  - hackmd notes
  - LUMI Software Stacks

Hands-on: Converting the PyTorch single GPU AI training job to use all GPUs in a single node via DDP¶

Exercises on the course GitHub.

Q&A¶

I'm not sure if this is the right section. I'm getting often following error when running job on small-g 4 out of 8 GPUs.
```
torch.distributed.DistBackendError: NCCL error in: 
  ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error, 
  NCCL version 2.16.5
ncclInternalError: Internal check failed.
Last error:
Failed to find reverse path from remNode 0/c1000 nlinks 4 to node 0/d9000
```
Any idea why? It happens randomly. Yes.. we use export CUDA_VISIBLE_DEVICES=$ROCR_VISIBLE_DEVICES
- How are you setting device visibility? If using ROCR_VISIBLE_DEVICES, try use HIP_VISIBLE_DEVICES instead.
- We have seen this error before but are not sure what's the reason for it. It is indeed not predictable and it only seems to happen on small-g and dev-g when reserving multiple GPUs but not the full node exclusively. We were suspecting it has to do with assumptions on CPU-GPU binding in PyTorch/NCCL that do not work well when partial nodes are reserved. But I guess the "good" thing about it's randomness is that you can re-run your job and then it probably works... -Lukas