Exercise session 7¶

See /project/project_465000524/slides/HPE/Exercises.pdf for the exercises.
Files are in /project/project_465000524/exercises/HPE/day3
Permanent archive on LUMI:
- Exercise notes in /appl/local/training/4day-20230530/files/LUMI-4day-20230530-Exercises_HPE.pdf
- Exercises as bizp2-compressed tar file in /appl/local/training/4day-20230530/files/LUMI-4day-20230530-Exercises_HPE.tar.bz2
- Exercises as uncompressed tar file in /appl/local/training/4day-20230530/files/LUMI-4day-20230530-Exercises_HPE.tar

Q&A¶

I tried perfools-lite on another example and got the following message from pat-report:

Observation:  MPI Grid Detection

    There appears to be point-to-point MPI communication in a 4 X 128
    grid pattern. The 24.6% of the total execution time spent in MPI
    functions might be reduced with a rank order that maximizes
    communication between ranks on the same node. The effect of several
    rank orders is estimated below.

    No custom rank order was found that is better than the RoundRobin
    order.

    Rank Order    On-Node    On-Node  MPICH_RANK_REORDER_METHOD
                 Bytes/PE  Bytes/PE%
                            of Total
                            Bytes/PE

    RoundRobin  1.517e+11    100.00%  0
          Fold  1.517e+11    100.00%  2
           SMP  0.000e+00      0.00%  1

Normally for this code, SMP rank ordering should make sure that collective communication is all intra-node and inter-node communication is limited to point-to-point MPI calls. So I don't really get why the recommendation is to switch to RoundRobin (if I understand this remark correctly)? Is this recommendation only based on analysing point-to-point communication?

Answer: Yes, you understood the remark correctly. This warning means that Cray PAT detected a suboptimal communication topology and according to the tool estimate, a round-robin rank ordering should maximize intra-node communications. There is a session about that at the beginning of the afternoon.

Reply: I would be very surprised if round-robin rank ordering would be beneficial in this case. I tried to run a job with it, but this failed with:

srun: error: task 256 launch failed: Error configuring interconnect

and similar lines for each task. The job script looks as follows:

module load LUMI/22.12 partition/C
module load cpeCray/22.12
module load cray-hdf5-parallel/1.12.2.1
module load cray-fftw/3.3.10.3

export MPICH_RANK_REORDER_METHOD=0
srun ${executable}