Extreme-scale AI¶

Presenters: Samuel Antão (AMD)

Content:

Model parallelism on LUMI via FSDP or DeepSpeed
Scaling beyond a single node

Extra materials¶

Q&A¶

Do you have experience of setting cpu affinity for Pytorch Lightning? Is it automatically taken care of?
- I don't have experience with Lightning in particular, but I'm pretty sure it doesn't work automatically as the specific setup differs from system to system and there's no easy way to automatically detect it.
Is it possible to set CPU affinity without using numactl, and directly via SLURM? I had some scripts that used datasets and accelerate that were only able to use the psysical cores (0 to n/2, where n is the number of threads shown in htop), unless we would launch the script via numactl --cpunodebind 0-3 --membind 0-3. The same would also apply on LUMI-C with NUMA nodes 0-7.
- Slurm has various options to set the affinity per task. Note that Linux exposes 2x the number of cpus as you have cores because each core supports two hardware threads. It is normally not advantageous to use the high numbered 'cpu's (the second thread per core). There is a talk on this topic in the 4-day LUMI comprehensive course but it is mostly written for people running MPI and MPI/OpenMP HPC applications.
- And there is also material in this presentation from the last 2-day course, slide 11 and following (though you may need to go to the slids before that also to understand), and recording from 22:20 on.
- The second hyperthread on LUMI is by default unavailable to applications started with srun as the option --hint=nomultithread is set by default. If you want access to them, you need to specify --hint=multithread, which you can already do in an #SBATCH line. Slurm uses control groups at the job level, i.e., if only part of a node is allocated to a job, this will be done with a cgroup on each node, and uses CPU affinity at the task level.