Advanced Placement¶
Presenter: Jean Pourroy (HPE)
Archived materials on LUMI:
-
Slides:
/appl/local/training/4day-20231003/files/LUMI-4day-20231003-2_01_Advanced_Application_Placement.pdf
-
Recording:
/appl/local/training/4day-20231003/recordings/2_01_Advanced_Application_Placement.mp4
These materials can only be distributed to actual users of LUMI (active user account).
Remark
The lumi-CPEtools
module
(in the LUMI software stacks, see this afternoon) contains alternatives for xthi
and the hello_jobstep
tool.
This module also contains the gpu_check
command which was mentioned in the presentation.
Q&A¶
-
Can you explain NUMA domain shortly?
- NUMA stands for Non-Uniform Memory Access, at a high level it means that when you do a memory operation some memory is "closer" to a processor than other menory.
- It was in the first presentation yesterday, so you can check back the recordings once they are available. Basically some memory is closer to some chiplets of the AMD processor than to others. On any machine, each socket is a NUMA domain as the communication between two sockets is bandwidth limited and adds a lot of latency to the memory access (x3.2). But on AMD internally one socket can also be subdivided in 4 NUMA domains, but there the latency difference is only 20%.
-
How do I see from command line in which NUMA domain my GPUs are?
- Future slides will cover this
lstopo
is a command that can give such information. But it may be easier to work with the tables in the documentation, and once you start running we have a program in the thelumi-CPEtools
module that shows you information about the binding of each MPI task. It is a very similar program to one that will be mentioned later in this lecture. You really have to look at the PCIe address as both the ROCm runtime and so-called Linux control groups can cause renumbering of the GPUs.rocm-smi
command also has option (--showtopo
) to show which NUMA domain GPU is attached to.
-
Where is the
gpu_check
tool? In perftools?- It is in lumi-CPEtools
-
So, if we check SLURM_LOCALID in our application and assign the GPU based on that (using the given mapping), we should have good binding?
- Yes. But only if you use exclusive nodes (so nodes in
standard-g
or full nodes insmall-g
with--exclusive
) as doing binding requires full control over all resources in the node. So if your application is very sensitive to communication between CPU and GPU you have to use full nodes and if your application cannot use all GPUS, try to start with srun multiple copies of your application with different input on the same node.
- Yes. But only if you use exclusive nodes (so nodes in
-
So, just to confirm: was using
--gpu-bind
ok or not?- Not OK if you want the tasks that
srun
starts to communicate with each other. It causes GPU-aware MPI to fail as it can no longer do GPU memory to GPU memory copy for intra-node communication between tasks.
- Not OK if you want the tasks that
-
Can we take advantage of GPU binding if using one GPU only?
- Only if you allocate a whole node and then start 8 of those tasks by playing a bit with
srun
and some scripts to start your application (e.g., use one srun to start 8 of those tasks via a script that does not only setROCR_VISIBLE_DEVICES
but also directs the copy of the application to the right input files). It is also possible to play with multiplesrun
commands running in the background but at several times in the past we have had problems with that on LUMI. We cannot always guarantee a good mapping onsmall-g
between the cores that you get allocated and the GPU that you get allocated. So if you have a lot of CPU-to-GPU communication going on you may notice large variations in performance between jobs onsmall-g
.
- Only if you allocate a whole node and then start 8 of those tasks by playing a bit with