Hands-on: Advancing your own project and general Q&A¶
Bring your own AI code, you want to run on LUMI, and spent some time applying what you have learned during the workshop - with on-site support from LUST/AMD.
Closing remarks¶
Extra materials¶
General Q&A¶
-
Is there a prettier way to activate conda? Right now I am using
lumi-pytorch-rocm-6.2.3-python-3.12-pytorch-v2.5.1.sifI need to activate conda to have access to the pytorch modules, so the current hack it to put this in my .sh fileIs there a better/prettier way to activate conda so I don't have to put all arguments inside the quotes and the bash call?srun singularity exec $CONTAINER \ bash -c "\$WITH_CONDA; python3 $PYTHON_FILE --max-samples=256"-
There is a trick and we use it in the EasyBuild modules that we provide for individual containers. The conda activate script just sets a number of environment variables so you can just set those from outside the container. The one tricky one, but one that is not really needed, is to adapt the prompt to show that you're in the virtual environment.
It has been a continuous discussion in LUST whether we should expose the container as much as possible as a container, and the singularity-AI-bindings module combined with the manual
$WITH_CONDAis a result of that, or try to hide as much of the complexity as we can, which is what we do with the individual modules for each container. In fact, what$WITH_CONDAdoes, is not the same for all containers (if only because the name of the conda environment is not always the same), so that cannot be done in a module as generic assingularity-AI-bindings.Those EasyBuild modules are not perfect and not all modules implement all functionality, basically because our time is also finite and we don't want to invest time in things that we don't know will be used. But we're always open to further develop them on request.
With some of the newer PyTorch EasyBuild containers, your command would become
srun singularity exec $CONTAINER \ python3 $PYTHON_FILE --max-samples=256"
-
-
For my machine learning model, I need to use the mpi4py library. What is the best way to install mpi4py on LUMI? I tried following the instructions from the slides, but it does not seem to be working for me.
-
Installing
mpi4pyrequires a compilation step of C compilation, which appears to fail because it doesn't find the proper compilers in the containers. There is a base container available that already includesmpi4pyin/appl/local/containers/sif-images/lumi-mpi4py-rocm-6.2.0-python-3.12-mpi4py-3.1.6.sif. Could you try extended that for your needs? If you usecotainr buildyou can replace the--system=lumi-gflag with--base-image=/appl/local/containers/sif-images/lumi-mpi4py-rocm-6.2.0-python-3.12-mpi4py-3.1.6.sif. - Lukas -
Nevermind, the mpi4py container doesn't actually work with cotainr :( - Lukas
-
-
Can you provide an example command please to monitor the GPU utilization for multi-node runs?
- You can do the following: Check from
squeue --mewhich nodes the job is running on (the node ids are listed in the last column). The you can use the commandsrun --overlap --jobid <jobid> -w <nodeid> rocm-smito runrocm-smion any of the nodes to check utilisation. Note that this will give you only an instantaneous snapshot of therocm-smioutput. You can usesrun --overlap --jobid <jobid> -w <nodeid> --pty bashto open a terminal on the node and usewatch rocm-smito continuously monitor the GPU usage. Unfortunately there is no handy tool to monitor GPU utilization across all nodes at once. See also this page in the LUMI documentation. - Lukas
- You can do the following: Check from