Questions session 9 May 2023¶

LUMI Hardware¶

"Real LUMI-G node" slide presented the CPU with two threads per core. I assume it is just to give full insight, what the CPU allows, but you are not using hyper-threading, right?

Answer: SMT is activated but not used by default excepted if you ask for it in your job script with --hint=multithread. The same applies to LUMI-C nodes.

Programming Environment & Module System¶

Is there a way to reload all sticky modules (with one command), if you first have unloaded all sticky modules.

Answer: If you have "force purged" all modules, you can get the default environment with module restore but it will not reload non-default sticky modules you may have loaded previously.

Using and Installing Software¶

Is "lumi-container-wrapper" related to https://cotainr.readthedocs.io/en/latest/ by DeiC from Denmark?

Answer:
- No. It is a different tool. It's the LUMI version of the tykky tool available on the CSC Puhti and Mahti clusters.
- The cotainr tool from DeiC is also available though (see the LUMI Software Library page) but we're waiting for them to test it on newer versions of the Cray PE.
What does RPM stand for?

Answer: RPM stands for Redhat Package Manager. It is a popular tool for distributing Linux software as binaries for direct installation in the OS image.
If I installed a software and prepared a module file by myself, is there a place I can contribute my module to the LUMI user community? Maybe it can save time when a new LUMI user is struggling installing the same exact software.

Answer: Yes, of course. We have a GitHub repository for that. https://github.com/Lumi-supercomputer/LUMI-EasyBuild-contrib/ Just create a pull request there and we will have a look and merge it
What options --try-toolchain-version=22.08 --robot help us with?

Answer: These options should work as in the regular EasyBuild. (and the -r I used is the abbreviation of --robot)

In fact, because I want a reproducible build environment most EasyConfigs that we offer also use the buildtools module, but as the version I use the toolchain_version variable so that that dependency is automatically adapted also. We bundle several build tools not only to have less modules, but also to have to adapt fewer dependency versions...
Which changes were made in lumi-container-wrapper compared to tykky?

Answer: It uses a different base container better suited for LUMI and for Python it is configure to build upon cray-python.
For how long will this project (project_465000522) be available?

Answer: It will stay active for the next two days (terminates on May, 11^th).
I have installed cp2k with easybuil, but I'm not able to run any programs, because it stops running in between and error file shows Segmentation fault - invalid memory reference? This is a problem I am not figuring out how to solve. Any help regarding this.

Answer: Too few details to say a lot. You may have made errors during installation though most of those errors would lead to different errors. However, you may have hit a bug in CP2K also, or a bug in the PRogramming Environment. We have several users who have successfully used our recipes, but they may have been trying different things with CP2K. Our configuration for CP2K (at least the CPU version) is also based on the configuration used on a similar Cray system at CSCS which is one of the machines used by the CP2K developers...

This is also more the type of problem that is better handled through a ticket though it is unclear from the description if we can do a lot about it.

Exercise session 1¶

Lmod shows "LUMI/23.03 partition/D" as the only available pre-requisite with LUMI/23.03 for the VNC module. Yet, despite this I am able to:
```
module load LUMI/23.03  partition/L
module load lumi-vnc
```
Does this mean that Lmod is just not able to show all of the possible pre-requisite combinations? Or will lumi-vnc turn out not to work correctly with partition/L?

Answer: partition/D should not even be shown, that is a bug in the software installation that I need to correct.

But a nice find. The output is confusing because LMOD gets confused by some hidden software stacks and partitions. It actually also shows a line init-lumi/0.2 which is a module that you have loaded automatically at login. And this in turns means that having this module is enough to use lumi-vnc, i.e., it works everywhere even without any LUMI stack or CrayEnv.
In general, if I set export EBU_USER_PREFIX for project directory, do I also need to set another one for the scratch?

Answer: No, it points to the root of the software installation which is one clear place. The reason to install in .project and not in /scratch is that you want the software installation to be available for your whole project. If you check the disk use policies in the storage section of the docs you'd see that data on /scratch is not meant to be there for the whole project but can be erased after 90 days which is not what you want with your software installation.

Running jobs¶

Is it possible to get an exemption from the the maximum 2 day allocation?

Answer: No. No exceptions are made at all. We don't want nodes to be monopolized by a user and it makes maintenance more difficult. Moreover, given the intrinsic instability of large clusters it is essential that jobs use codes that store intermediate states from which can be restarted.
```
Users have been using dependent jobs with success to automatically start the follow-on job after the previous one ends.
```
Is it not possible to use the singularity build command?

Answer: Not all options of singularity build work. Any build requiring fakeroot will fail as that is disabled due to security concerns.
Does this (binding tasks to resrouces) mean that it will be necessary to use custom bindings to align tasks with specific NUMA domains on each CPU? (Since NUMA domains seem to be a level in-between cores and sockets)

Answer: If you are using all cores on an exclusive node the standard ways in which Slurm distributes processes and threads may do just what you want.

Even if, e.g., it would turn out that it is better to use only 75% of the cores and you would be using 16 MPI processes with 6 threads per process, then a creative solution is to ask Slurm for 8 cores per task and then set OMP_NUM_THREADS=6 to only start 6 threads. There are often creative solutions.

Some clusters will redefine the socket to coincide with the NUMA domain but it looks like this is not done on LUMI.
Where do we look up the specific NUMA domain / GPU correspondence? In the LUMI documentation? Or perhaps by a command in LUMI?

Answer:
- See this image from the LUMI-G hardware documentation page. It presents the numbering from the point of view of the GPUs and CPU.
- See also rocm-smi --showtoponuma (on a LUMI-G node)

If we enable hardware threads for a job/allocation, does "--cpus-per-task" become HW threads?

Answer: Yes:

# 2 HWT per core, all cores allocated 
$ srun -pstandard --cpus-per-task=256 --hint=multithread --pty bash -c 'taskset -c -p $$'
pid 161159's current affinity list: 0-255

# 2 HWT per core but only the first 64 cores allocated 
$ srun -pstandard --cpus-per-task=128 --hint=multithread --pty bash -c 'taskset -c -p $$'
pid 161411's current affinity list: 0-63,128-191

Is it possible to make sure a job requesting 16 cores is allocated all cores in one NUMA domain?

Answer:
- Is your question for sub-node allocations (small and small-g partitions)?
  - Yes, the question is not relevant for production runs, it was only out of interest. It is something to be aware of during scaling tests for example.
    - Our advise to users in our local compute centre who have to do scaling tests to submit a proposal for time on LUMI (or our Tier-1 systems) is to use exclusive nodes to avoid surprises and reduce randomness. The most objective way is probably if you want to do a test on 16 nodes to run 8 such tests next to one another to fill up the node. Because there is another issue also. I haven't tried if the options to fix the clock speed from Slurm work on LUMI, but depending on the other work that is going on in a socket the clock speed of the cores may vary.
- I doubt there is a way to do that for the sub-node allocation partitions. I can't find one that works at the moment. Binding really only works well on job-exclusive nodes. For me this is a shortcomming of Slurm as it doesn't have enough levels in its hierarchy for modern clusters.
- For a sub-node allocation, you will get random cores depending on which cores are available:
```
$ srun -psmall --cpus-per-task=16 --pty bash -c 'taskset -c -p $$'
pid 46818's current affinity list: 44-46,50,52-63

$ srun -pstandard --cpus-per-task=16 --pty bash -c 'taskset -c -p $$'
pid 220496's current affinity list: 0-15
```

Storage¶

To access /scratch/ from a container, we have to mount it. However, we need the full path and not just the symlink. Where do we find the full path?

Answer: You can use file /scratch/project_465000522 for example. Don't try to mount the whole scratch. That will not work. The project_* subdirectories in /scratch are distributed across the 4 file systems of LUMI-P. ls -l /scratch/project_465000522 will actually show you which file system is serving that project's scratch directory.
The documentation page states that "Automatic cleaning of project scratch and fast storage is not active at the moment". Is this still true, and will the users be informed if this changes?

Answer:
- This is still true today and usually users are informed about changes but with short notice. Quota were also disabled for a while due to problems after an upgrade last July but solving those problems became a higher priority when abuse was noticed, and there was only 14 days notice. So abusing scratch and flash for long-term storage is asking for increasing the priority of that part of the LUMI setup working... I'd say, don't count on it as the message may arrive as well when you are on a holiday.
- For slightly longer time storage but still limited to the lifetime of your project there is also the object storage. However, at the moment only rather basic tools to use that storage are already available.
What is the preferred way for transferring data between LUMI and some external server or other supercomputer e.g. CSC Mahti?

Answer:
- Mahti is so close to LUMI (as far as I know even in the same data centre but a different hall) that connection latency should not limit bandwidth so that you can just use sftp.
  
  I believe the CSC documentation also contains information on how to access the allas object storage from LUMI. Using allas as intermediate system is also an option. Or the LUMI-O storage but at the moment allas is both more developed and better documented. (For readers: allas is a CSC object system only available to users that are CSC clients and not to other users of LUMI.)
- For supercomputers that are "farther away" from LUMI where bandwidth is a problem when using sftp, it looks like the LUMI-O object storage is a solution as the tools that read from the object storage use so-called "multi-stream transport" so that they can better deal with connections with a high latency. The documentation on how to access the LUMI object storage from elsewhere needs work though.