Hands-on: Run a simple single-GPU PyTorch AI training job¶
Exercises on the course GitHub.
Q&A¶
-
I get this:
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
- It is exactly what this said: You cannot combine different ways of requesting memory in a single job. You should make a choice between
--mem-per-cpu
,--mem-per-gpu
or--mem
.
I follow exactly the solution which is --mem-per-gpu.
- Are you launching the job from a Login node shell? If you launch from a Compute node shell it might already have some Slurm settings activated.
- I'm running from vscode terminal.
- It is exactly what this said: You cannot combine different ways of requesting memory in a single job. You should make a choice between
-
Would
--cpus-per-gpu=7
also work when allocating on LUMI-G, instead of--cpus-per-task=7 * ngpus
?-
If you mean as an alternative to
--cpus-per-task
, then probably, although I haven't tested it. -
The
srun
manual is a bit confusing, I'd avoid it as it implies another option which is only for job steps. Or at least check carefully t first time you use it with techniques we will see later today that you don't get more GPUs in your job allocation than you expected.Slurm can sometimes show very unexpected behaviour with some options. They may also conflict with standard options set for some partitions.
-
-
If we used
sbatch run.sh
, why do we use srun within it? Is there any advantage of it?-
sbatch
only creates a job allocation. Work in Slurm is usually done injob steps
, and they are created withsrun
. Some options that you specify withsbatch
only take effect in regular job steps created withsrun
.Well, I am oversimplifying, as
sbatch
does create a special job step, called the "batch job step". But that one only has resources on the first node. It basically gets all resources requested on the first node of the job. Settings like--hint=nomultithread
(which is actually a default on LUMI) also have no effect in the batch job step.In a single-node job
srun
may not seem useful, but it actually is if you start a multi-process job. Moreover,srun
is the only way to get access to other nodes of the job apart from the one on which the batch job step runs. You will see examples later in this course.
-