PyTorch
License information
The PyTorch license can be found in the LICENSE file in the PyTorch GitHub.
Note however that in order to use PyTorch you will also be using several other packages that have different licenses.
User documentation (user installation)
We used to provide an EasyBuild recipe to install PyTorch on top of Cray Python. However, as Python packages tend to put a heavy strain on the file system, installing Python packages in a container is the preferred way. It also takes away the strain of trying to get PyTorch talk to a proper version of the AWS OFI RCCL plugin which is needed for proper communication on the Slingshot 11 interconnect of LUMI.
We now provide prebuilt singularity containers with EasyBuild-generated module around them that eases work with those containers. The use is documented in the next section, "User documentation (singularity container)" while the user-installable EasyBuild recipes for each container can be found in the "Singularity containers with modules for binding and extras" section.
User documentation (singularity container)
BETA VERSION, problems may occur and may not be solved quickly.
The containers that are provided by the LUMI User Support Team can be used in two possible ways:
-
Directly, with you taking care of all bindings and all necessary environment variables.
These instructions will likely also work for the containers built on top of the ROCm containers with cotainr.
Containers with PyTorch provided in local software stacks (e.g., the CSC software stack) may be build differently with different wrapper scripts so instructions on this page may not apply to those.
Module and wrapper scripts
The PyTorch container is developed by AMD specifically for LUMI and contains the necessary parts to run PyTorch on LUMI, including the plugin needed for RCCL when doing distributed AI, and a suitable version of ROCm for the version of PyTorch. The apex, torchvision, torchdata, torchtext and torchaudio packages are also included.
The EasyBuild installation with the EasyConfigs mentioned below will do three or four things:
-
It will copy the container to your own EasyBuild software installation space. We realise containers can be big, but it ensures that you have complete control over when a container is removed.
We will remove a container from the system when it is not sufficiently functional anymore, but the container may still work for you. E.g., after an upgrade of the network drivers on LUMI, the RCCL plugin for the LUMI Slingshot interconnect may be broken, but if you run on only one node PyTorch may still work for you.
If you prefer to use the centrally provided container, you can remove your copy after loading of the module with
rm $SIF
followed by reloading the module. This is however at your own risk. -
It will create a module file. When loading the module, a number of environment variables will be set to help you use the module and to make it easy to swap the module with a different version in your job scripts.
-
SIF
andSIFPYTORCH
both contain the name and full path of the singularity container file. -
SINGULARITY_BINDPATH
will mount all necessary directories from the system, including everything that is needed to access the project, scratch and flash file systems. -
RUNSCRIPTS
andRUNSCRIPTSPYTORCH
contain the full path of the directory containing some sample run scripts that can be used to run software in the container, or as inspiration for your own variants.
Container modules installed after March 9, 2024 also define
SINGULARITYENV_PREPEND_PATH
in a way that ensures that the/runscripts
subdirectory in the container will be in the search path in the container.The containers with support for a virtual environment (from 20240315 on) define a few other
SINGULARITYENV_*
environment variables that inject environment variables in the container that are equivalent to those created by the activate scripts for the Conda environment and the Python virtual environment. -
-
It creates 3 scripts in the $RUNSCRIPTS directory:
-
conda-python-simple
: This initialises Python in the container and then calls Python with the arguments ofconda-python-simple
. It can be used, e.g., to run commands through Python that utilise a single task but all GPUs. -
conda-python-distributed
: Model script that initialises Python in the container and also creates the environment to run a distributed PyTorch session. At the end, it will call Python with the arguments of theconda-python-distributed
command. -
get-master
: A helper command forconda-python-distributed
.
These scripts are available in the container in the
/runscripts
subdirectory but can also be reached with their full path name, and can be inspected outside the container in the$RUNSCRIPTS
subdirectory.Those scripts don't cover all use cases for PyTorch on LUMI, but can be used as a source of inspiration for your own scripts.
-
-
For the containers with support for virtual environments (from 20240315 on), it also creates a number of commands intended to be used outside the container:
-
start-shell
: To start a bash shell in the container. Arguments can be used to, e.g., tell it to start a command. Without arguments, the conda and Python virtual environments will be initialised, but this is not the case as soon as arguments are used. It takes the command line arguments that bash can also take. -
make-squashfs
: Make the user-software.squashfs file that would then be mounted in the container after reloading the module. This will enhance performance if the extra installation in user-software contains a lot of files. -
unmake-squashfs
: Unpack the user-software.squashfs file into the user-software subdirectory of $CONTAINERROOT to enable installing additional packages.
-
The container uses a miniconda environment in which Python and its packages are installed.
That environment needs to be activated in the container when running, which can be done
with the command that is available in the container as the environment variable
WITH_CONDA
(which for this container it is
source /opt/miniconda3/bin/activate pytorch
).
From the 20240315 version onwards, EasyBuild will already initialise the Python virtual
environment pytorch
. Inside the container, the virtual environment is available in
/user-software/venv
while outside the container the files can be found in
$CONTAINERROOT/user-software/venv
(if this directory has not been removed after creating
a SquashFS file from it for better file system performance). You can also use the
/user-software
subdirectory in the container to install other software through other methods.
Examples with the wrapper scripts
Note: In the examples below you may need to replace the standard-g
queue with a different
slurm partition allocatable per node
if your user category has no access to standard-g
.
List the Python packages in the container
Containers up to and including the 20240209 ones
For the containers up to the 20240209 ones, this example also illustrated how the
WITH_CONDA
environment variable should be used.
The example can be run in an interactive session and works even on the login nodes.
In these containers, the Python packages can be listed using the following steps: First execute, e.g.,
which takes you in the container, and then in the container, at the Singularity>
prompt:
The same can be done without opening an interactive session in the container with
module load LUMI PyTorch/2.2.0-rocm-5.6.1-python-3.10-singularity-20240209
singularity exec $SIF bash -c '$WITH_CONDA ; pip list'
Notice the use of single quotes as with double quotes $WITH_CONDA
would be expanded
by the shell before executing the singularity command, and at that time WITH_CONDA
is
not yet defined. To use the container it also doesn't matter which version of the
LUMI module is loaded, and in fact, loading CrayEnv would work as well.
Containers from 20240315 on
For the containers from version 20240315 on, the $WITH_CONDA
is no longer needed.
In an interactive session, you still need to load the module and go into the container:
but once in the container, at the Singularity>
prompt, all that is needed is
Without an interactive session in the container, all that is now needed is
module load LUMI PyTorch/2.2.0-rocm-5.6.1-python-3.10-singularity-20240315
singularity exec $SIF pip list
as the pip
command is already in the search path.
Executing Python code in the container (single task)
Containers up to and including the 20240209 ones
The wrapper script conda-python-single
which can be found in the /runscripts
directory
in the container, takes care of initialising the Conda environment and then passes its
arguments to the python
command. E.g., the example below will import the torch
package in Python and then show the number of GPUs available to it:
salloc -N1 -pstandard-g -t 30:00
module load LUMI PyTorch/2.1.0-rocm-5.6.1-python-3.10-singularity-20240209
srun -N1 -n1 --gpus 8 singularity exec $SIF conda-python-simple \
-c 'import torch; print("I have this many devices:", torch.cuda.device_count())'
exit
This command will start Python and run PyTorch on a single CPU core with access to all 8 GPUs.
Container modules installed before March 9, 2024
In these versions of the container module, conda-python-simple
is not yet in
the search path for executables, and you need to modify the job script to use
/runscripts/conda-python-simple
instead.
Containers from 20240315 on
As the Conda environment and Python virtual environment are properly initialised by the
module, the conda-python-simple
script is not even needed anymore (though still provided
for compatibility with job scripts developed before those containers became available).
The following commands now work just as well:
salloc -N1 -pstandard-g -t 30:00
module load LUMI PyTorch/2.2.0-rocm-5.6.1-python-3.10-singularity-20240315
srun -N1 -n1 --gpus 8 singularity exec $SIF python \
-c 'import torch; print("I have this many devices:", torch.cuda.device_count())'
exit
Distributed learning example
The communication between LUMI's GPUs during training with PyTorch is done via
RCCL, which is a library of collective
communication routines for AMD GPUs. RCCL works out of the box on LUMI, however,
a special plugin is required so it can take advantage of the Slingshot 11 interconnect.
That's the aws-ofi-rccl
plugin,
which is a library that can be used as a back-end for RCCL to interact with the interconnect
via libfabric. The plugin is already built in the containers that we provide here.
A proper distributed learning run does require setting some environment variables.
You can find out more by checking the scripts in $EBROOTPYTORCH/runscripts
(after
installing and loading the module), and in particular the
conda-python-distributed
script and the get-master
script used by the former.
Together these scripts make job scripts a lot easier.
An example job script using the mnist example (itself based on an example by Google) is:
-
The mnist example needs some data files. We can get them in the job script (as we did before) but also simply install them now, avoiding repeated downloads when using the script multiple times (in the example with wrappers it was in the job script to have a one file example). First create a directory for your work on this example and go into that directory. In that directory we'll create a subdirectory
mnist
with some files. The first run of the jobscript will download even more files. Assuming you are working on the login nodes where thewget
program is already available,mkdir mnist ; pushd mnist wget https://raw.githubusercontent.com/Lumi-supercomputer/lumi-reframe-tests/main/checks/containers/ML_containers/src/pytorch/mnist/mnist_DDP.py mkdir -p model ; cd model wget https://github.com/Lumi-supercomputer/lumi-reframe-tests/raw/main/checks/containers/ML_containers/src/pytorch/mnist/model/model_gpu.dat popd
will fetch the two files we need to start.
-
We can now create the jobscript
mnist.slurm
:#!/bin/bash -e #SBATCH --nodes=4 #SBATCH --gpus-per-node=8 #SBATCH --tasks-per-node=8 #SBATCH --cpus-per-task=7 #SBATCH --output="output_%x_%j.txt" #SBATCH --partition=standard-g #SBATCH --mem=480G #SBATCH --time=00:10:00 #SBATCH --account=project_<your_project_id> module load LUMI # Which version doesn't matter, it is only to get the container. module load PyTorch/2.2.0-rocm-5.6.1-python-3.10-singularity-20240315 # Optional: Inject the environment variables for NCCL debugging into the container. # This will produce a lot of debug output! export SINGULARITYENV_NCCL_DEBUG=INFO export SINGULARITYENV_NCCL_DEBUG_SUBSYS=INIT,COLL c=fe MYMASKS="0x${c}000000000000,0x${c}00000000000000,0x${c}0000,0x${c}000000,0x${c},0x${c}00,0x${c}00000000,0x${c}0000000000" cd mnist srun --cpu-bind=mask_cpu:$MYMASKS \ singularity exec $SIFPYTORCH \ conda-python-distributed -u mnist_DDP.py --gpu --modelpath model
Container modules installed before March 9, 2024
In these versions of the container module,
conda-python-distributed
is not yet in the search path for executables, and you need to modify the job script to use/runscripts/conda-python-distributed
instead.We use a CPU mask to ensure a proper mapping of CPU chiplets onto GPU chiplets. The GPUs are used in the regular ordering, so we reorder the CPU cores for each task so that the first task on a node gets the cores most closely to GPU 0, etc.
The jobscript also shows how environment variables to enable debugging of the RCCL communication can be set outside the container. Basically, if the name of an environment variable is prepended with
SINGULARITYENV_
, it will be injected in the container by thesingularity
command.
Inside the conda-python-distributed
script (if you need to modify things)
#!/bin/bash -e
# Make sure GPUs are up
if [ $SLURM_LOCALID -eq 0 ] ; then
rocm-smi
fi
sleep 2
# MIOPEN needs some initialisation for the cache as the default location
# does not work on LUMI as Lustre does not provide the necessary features.
export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-$SLURM_NODEID"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
if [ $SLURM_LOCALID -eq 0 ] ; then
rm -rf $MIOPEN_USER_DB_PATH
mkdir -p $MIOPEN_USER_DB_PATH
fi
sleep 2
# Set interfaces to be used by RCCL.
# This is needed as otherwise RCCL tries to use a network interface it has
# no access to on LUMI.
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export NCCL_NET_GDR_LEVEL=3
# Set ROCR_VISIBLE_DEVICES so that each task uses the proper GPU
export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID
# Report affinity to check
echo "Rank $SLURM_PROCID --> $(taskset -p $$); GPU $ROCR_VISIBLE_DEVICES"
# The usual PyTorch initialisations (also needed on NVIDIA)
# Note that since we fix the port ID it is not possible to run, e.g., two
# instances via this script using half a node each.
export MASTER_ADDR=$(/runscripts/get-master "$SLURM_NODELIST")
export MASTER_PORT=29500
export WORLD_SIZE=$SLURM_NPROCS
export RANK=$SLURM_PROCID
# Run application
python "$@"
The script sets a number of environment variables. Some are fairly standard when using PyTorch on an HPC cluster while others are specific for the LUMI interconnect and architecture or the AMD ROCm environment.
The MIOPEN_
environment variables are needed to make
MIOpen create its caches on /tmp
as doing this on Lustre fails because of file locking issues:
export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-$SLURM_NODEID"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
if [ $SLURM_LOCALID -eq 0 ] ; then
rm -rf $MIOPEN_USER_DB_PATH
mkdir -p $MIOPEN_USER_DB_PATH
fi
It is also essential to tell RCCL, the communication library, which network adapters to use.
These environment variables start with NCCL_
because ROCm tries to keep things as similar as
possible to NCCL in the NVIDIA ecosystem:
Without this RCCL may try to use a network adapter meant for system management rather than inter-node communications!
We also set ROCR_VISIBLE_DEVICES
to ensure that each task uses the proper GPU.
Furthermore some environment variables are needed by PyTorch itself that are also needed on NVIDIA systems.
PyTorch needs to find the master for communication which is done through
The get-master
script that is used here is a Python script to determine the master node
for communication and also already provided in the /runscripts
subdirectory in the
container (or $RUNSCRIPTS
outside the container).
As we fix the port number here, the conda-python-distributed
script that we provide,
has to run on exclusive nodes.
Running, e.g., 2 4-GPU jobs on the same node with this command will not work as there will be
a conflict for the TCP port for communication on the master as MASTER_PORT
is hard-coded in
this version of the script.
Installation with EasyBuild
To install the container with EasyBuild, follow the instructions in the
EasyBuild section of the LUMI documentation, section "Software",
and use the dummy partition container
, e.g.:
module load LUMI partition/container EasyBuild-user
eb PyTorch-2.2.0-rocm-5.6.1-python-3.10-singularity-20240315.eb
To use the container after installation, the EasyBuild-user
module is not needed nor
is the container
partition. The module will be available in all versions of the LUMI stack
and in the CrayEnv
stack
(provided the environment variable EBU_USER_PREFIX
points to the right location).
After loading the module, the docker definition file used when building the container
is available in the $EBROOTPYTORCH/share/docker-defs
subdirectory (but not for all
versions). As it requires some
licensed components from LUMI and some other files that are not included, it currently
cannot be used to reconstruct the container and extend its definition.
Extending the containers with virtual environment support
This text is for containers from 20240315 on. Other containers can be extended with virtual environments also but you'll have to do a lot more work by hand that is now done by the module, or adapt the EasyConfig for those based on what is in the more recent EasyConfigs.
Manual procedure
Let's demonstrate how the module can be extended by using pip
to install packages in the virtual
environment. We'll demonstrate using the PyTorch/2.2.0-rocm-5.6.1-python-3.10-singularity-20240315
module where we assume that you have already installed this module:
Let's check a directory outside the container:
which produces something along the lines of
drwxrwsr-x 2 username project_46XYYYYYY 4096 Mar 25 17:15 bin
drwxrwsr-x 2 username project_46XYYYYYY 4096 Mar 25 17:14 include
drwxrwsr-x 3 username project_46XYYYYYY 4096 Mar 25 17:14 lib
lrwxrwxrwx 1 username project_46XYYYYYY 3 Mar 25 17:14 lib64 -> lib
-rw-rw-r-- 1 username project_46XYYYYYY 94 Mar 25 17:15 pyvenv.cfg
The output is typical for a freshly initialised Python virtual environment.
We can now enter the container:
At the singularity prompt, try
and notice that we have the same output as with the previous ls
command that we executed outside
the container. So the RCONTAINERROOT/user-software
subdirectory is available in the container
as /user-software
.
Executing
which return the lines
also shows that the virtual environment is already activated and that we get the python
wrapper script
from the virtual environment and not the system python3
(there is a python3
executable in /usr/bin
)
or the Conda python
in /opt/miniconda3/envs/pytorch/bin
.
Let us install the torchmetrics
package using pip
:
To check if the package is present and can be loaded, try
and notice that it does print the version number of torchmetrics
, so the package was
successfully loaded.
Now execute
and you'll get output similar to
_distutils_hack pkg_resources
distutils-precedence.pth setuptools
lightning_utilities setuptools-65.5.0.dist-info
lightning_utilities-0.11.1.dist-info torchmetrics
pip torchmetrics-1.3.2.dist-info
pip-23.0.1.dist-info
which confirms that the torchmetrics
package is indeed installed in the virtual environment.
Let's leave the container (by executing the exit
command) and check again what has happened outside
the container:
and we get the same output as with the previous ls
command. I.e., the installation file of the package
is indeed saved outside the container.
Now there is one remaining problem. Try
where lfs find
is a version of the find
command with some restrictions, but one that is a lot more
friendly to the Lustre metadata servers. The output suggests that there are over 2300 files and directories
in the user-software
subdirectory. The Lustre filesystem doesn't like working with lots of small files
and Python can sometimes open a lot of those files in a short amount of time.
The module also provides a solution for this: The content of $CONTAINERROOT/user-software
can be packed
in a single SquashFS file $CONTAINERROOT/user-software.squashfs
and after reloading the PyTorch
module that
is being used, that file will be mounted in the container and provide /user-software
. This may improve
performance of Python in the container and is certainly appreciated by your fellow LUMI users.
To this end, the module provides the make-squashfs
script. Try
The second command outputs something along the lines of
bin
easybuild
lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.2.0-dockerhash-7392c9d4dcf7.sif
runscripts
user-software
user-software.squashfs
so we see that there is now indeed a file user-software.squashfs
in that subdirectory.
We do not automatically delete the user-software
subdirectory, but you can delete it safely using
as it can be reconstructed (except for the file dates) from the SquashFS file using the script
unmake-squashfs
which is also provided by the module.
Reload the module to let the changes take effect and go again in the container:
Now try
and notice that we can no longer write in /user-software
.
Installing further packages with pip
would not fail, but they would not be installed where you expect and instead
would be installed in your home directory. The pip
command would warn with
Try, e.g.,
and notice that the package (likely) landed in ~/.local/lib/python3.10/site-packages
:
will among other subdirectories contain the subdirectory pytorch_lightning
and this is
not entirely what we want.
Yet it is still possible to install additional packages by first unsquashing the user-software.squashfs
file
with
(assuming that you had removed the $CONTAINERROOT/user-software
subdirectory before),
then deleting the SquashFS file:
and reload the module. Make sure though that you first remove the packages that were accidentally installed
in ~/.local
.
One big warning is needed here though: If you do a complete re-install of the module with EasyBuild,
everything in the installation directory is erased, including your own installation. So just to make sure,
you may want to keep a copy of the user-software.squashfs
file elsewhere.
Automation of the procedure
Try this procedure preferably from a directory that doesn't contain too many files or subdirectories as that may slow down EasyBuild considerably.
In some cases it is possible to adapt the EasyConfig file to also install the additional Python packages
that are not yet included in the container. This is demonstrated in the
PyTorch-2.2.0-rocm-5.6.1-python-3.10-singularity-exampleVenv-20240315.eb
example EasyConfig file
which is available on LUMI. First load EasyBuild to install containers, e.g.,
and then we can use EasyBuild to copy the recipe to our current directory:
You can now inspect the .eb
file with your favourite editor. This file basically defines a lot
of Python variables that EasyBuild uses, but is also a small program so we can even define and use
extra variables that EasyBuild does not know. The magic happens in two blocks.
First,
(with an empty line at the end) defines the content that we will put in a requirements.txt
file to
tell pip
which packages we want to install.
The second part of the magic happens in some lines in the postinstallcmds
block, a list of commands
that EasyBuild will execute after the default installation procedure (which only copies the container
.sif
file to its location). Four lines in particular perform the magic:
f'cat >%(installdir)s/user-software/venv/requirements.txt <<EOF {local_pip_requirements}EOF',
f'singularity exec --bind {local_singularity_bind} --bind %(installdir)s/user-software:/user-software %(installdir)s/{local_sif} bash -c \'source /runscripts/init-conda-venv ; cd /user-software/venv ; pip install -r requirements.txt\'',
'%(installdir)s/bin/make-squashfs',
'/bin/rm -rf %(installdir)s/user-software',
The first line creates the requirements.txt
file from the local_pip_requirements
variable that we have
created. The way to do this is a bit awkward by creating a shell command from it, but it works in most cases.
The second line then calls pip install
in the singularity container. At this point there is no module yet
so we need to do all bindings by hand and use variables that are known to EasyBuild.
The third line then creates the user-software.squashfs
file and the last line deletes the user-software
subdirectory. These four lines are generic as the package list is defined via the
local_pip_requirements
environment variable.
Alternative: Direct access
Getting the container image
The PyTorch containers are available in the following subdirectories of /appl/local/containers
:
-
/appl/local/containers/sif-images
: Symbolic link to the latest version of the container with the given mix of components/packages mentioned in the filename. Other packages in the container may vary over time and change without notice. -
/appl/local/containers/tested-containers
: Tested containers provided as a Singulartiy.sif
file and a docker-generated tarball. Containers in this directory are removed quickly when a new version becomes available. -
/appl/local/containers/easybuild-sif-images
: Singularity.sif
images used with the EasyConfigs that we provide. They tend to be available for a longer time than in the other two subdirectories.
If you depend on a particular version of a container, we recommend that you copy the container to
your own file space (e.g., in /project
,) as there is no guarantee the specific version will remain
available centrally on the system for as long as you want.
When using the containers without the modules, you will have to take care of the bindings as some system files are needed for, e.g., RCCL. The recommended mininmal bindings are:
and the bindings you need to access the files you want to use from /scratch
, /flash
and/or /project
.
Note that the list recommended bindings may change after a system update.
Alternatively, you can also build your own container image on top of the ROCm containers that we provide with cotainr.
If you use PyTorch containers from other sources, take into account that
-
They need to explicitly use ROCm-enabled versions of the packages. NVIDIA packages will not work.
-
The RCCL implementation provided in the container will likely not work well with the communication network and the AWS RCCL plugin for OFI plugin will still need to be installed in a way that the libfabric library on LUMI is used.
-
Similarly the
mpi4py
package (if included) may not be compatible with the interconnect on LUMI, also resulting in poor performance or failure. You may want to make sure that an MPI implementation that is ABI-compatible with Cray MPICH is used so that you can then try to overwrite it with Cray MPICH.
The LUMI User Support Team tries to support the containers that it provides as good as possible, but we are not the PyTorch support team and have limited resources. In no way is it the task of the LUST to support any possible container from any possible source. See also our page "Software Install Policy in the main LUMI documentation.
Example: Distributed learning without the wrappers
For easy comparison, we use the same mnist example already used in the "Distributed learning example" with the wrapper scripts. The text is written in such a way though that it can be read without first reading that section.
-
First one needs to create the script
get-master.py
that will be used to determine the master node for communication:2. Next we need another script that will run in the container to set up a number of environment variables that are needed to run PyTorch successfully on LUMI and at the end, call Python to run our example. Let's store the following script asimport argparse def get_parser(): parser = argparse.ArgumentParser(description="Extract master node name from Slurm node list", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("nodelist", help="Slurm nodelist") return parser if __name__ == '__main__': parser = get_parser() args = parser.parse_args() first_nodelist = args.nodelist.split(',')[0] if '[' in first_nodelist: a = first_nodelist.split('[') first_node = a[0] + a[1].split('-')[0] else: first_node = first_nodelist print(first_node)
run-pytorch.sh
.#!/bin/bash -e # Make sure GPUs are up if [ $SLURM_LOCALID -eq 0 ] ; then rocm-smi fi sleep 2 # !Remove this if using an image extended with cotainr or a container from elsewhere.! # Start conda environment inside the container $WITH_CONDA # MIOPEN needs some initialisation for the cache as the default location # does not work on LUMI as Lustre does not provide the necessary features. export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-$SLURM_NODEID" export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH if [ $SLURM_LOCALID -eq 0 ] ; then rm -rf $MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH fi sleep 2 # Optional! Set NCCL debug output to check correct use of aws-ofi-rccl (these are very verbose) export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=INIT,COLL # Set interfaces to be used by RCCL. # This is needed as otherwise RCCL tries to use a network interface it has # no access to on LUMI. export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 export NCCL_NET_GDR_LEVEL=3 # Set ROCR_VISIBLE_DEVICES so that each task uses the proper GPU export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID # Report affinity to check echo "Rank $SLURM_PROCID --> $(taskset -p $$); GPU $ROCR_VISIBLE_DEVICES" # The usual PyTorch initialisations (also needed on NVIDIA) # Note that since we fix the port ID it is not possible to run, e.g., two # instances via this script using half a node each. export MASTER_ADDR=$(python get-master.py "$SLURM_NODELIST") export MASTER_PORT=29500 export WORLD_SIZE=$SLURM_NPROCS export RANK=$SLURM_PROCID export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID # Run app cd /workdir/mnist python -u mnist_DDP.py --gpu --modelpath model
What's going on in this script? (click to expand)
The script sets a number of environment variables. Some are fairly standard when using PyTorch on an HPC cluster while others are specific for the LUMI interconnect and architecture or the AMD ROCm environment.
At the start we just print some information about the GPU. We do this only ones on each node on the process which is why we test on
$SLURM_LOCALID
, which is a numbering starting from 0 on each node of the job:The container uses a Conda environment internally. So to make the right version of Python and its packages availabe, we need to activate the environment. The precise command to activate the environment is stored in
$WITH_CONDA
and we can just call it by specifying the variable as a bash command.The
MIOPEN_
environment variables are needed to make MIOpen create its caches on/tmp
as doing this on Lustre fails because of file locking issues:export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-$SLURM_NODEID" export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH if [ $SLURM_LOCALID -eq 0 ] ; then rm -rf $MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH fi
It is also essential to tell RCCL, the communication library, which network adapters to use. These environment variables start with
NCCL_
because ROCm tries to keep things as similar as possible to NCCL in the NVIDIA ecosystem:Without this RCCL may try to use a network adapter meant for system management rather than inter-node communications!
We also set
ROCR_VISIBLE_DEVICES
to ensure that each task uses the proper GPU. This is again based on the local task ID of each Slurm task.Furthermore some environment variables are needed by PyTorch itself that are also needed on NVIDIA systems.
PyTorch needs to find the master for communication which is done through the
get-master.py
script that we created before:As we fix the port number here, the
conda-python-distributed
script that we provide, has to run on exclusive nodes. Running, e.g., 2 4-GPU jobs on the same node with this command will not work as there will be a conflict for the TCP port for communication on the master asMASTER_PORT
is hard-coded in this version of the script.Make sure the
run-pytorch.sh
script is executable: -
The mnist example also needs some data files. We can get them in the job script (as we did before) but also simply install them now, avoiding repeated downloads when using the script multiple times (in the example with wrappers it was in the job script to have a one file example). Assuming you do this on the login nodes where the
wget
program is already available,mkdir mnist ; pushd mnist wget https://raw.githubusercontent.com/Lumi-supercomputer/lumi-reframe-tests/main/checks/containers/ML_containers/src/pytorch/mnist/mnist_DDP.py mkdir -p model ; cd model wget https://github.com/Lumi-supercomputer/lumi-reframe-tests/raw/main/checks/containers/ML_containers/src/pytorch/mnist/model/model_gpu.dat popd
-
Finaly we can create our jobscript, e.g.
mnist.slurm
, which we will launch from the directory that also contains themnist
subdirectory andget-master.py
andrun-pythorch.sh
scripts and the container image.#!/bin/bash -e #SBATCH --nodes=4 #SBATCH --gpus-per-node=8 #SBATCH --tasks-per-node=8 #SBATCH --cpus-per-task=7 #SBATCH --output="output_%x_%j.txt" #SBATCH --partition=standard-g #SBATCH --mem=480G #SBATCH --time=00:10:00 #SBATCH --account=project_<your_project_id> CONTAINER=your-container-image.sif c=fe MYMASKS="0x${c}000000000000,0x${c}00000000000000,0x${c}0000,0x${c}000000,0x${c},0x${c}00,0x${c}00000000,0x${c}0000000000" srun --cpu-bind=mask_cpu:$MYMASKS \ singularity exec \ -B /var/spool/slurmd \ -B /opt/cray \ -B /usr/lib64/libcxi.so.1 \ -B /usr/lib64/libjansson.so.4 \ -B $PWD:/workdir \ $CONTAINER /workdir/run-pytorch.sh
Known restrictions and problems
torchrun
cannot be used on LUMI (and many other HPC clusters) as it uses a mechanism to start tasks that does not go through the resource manager of the cluster and hence if enabled could enable users to steal resources from other users on shared nodes.
Singularity containers with modules for binding and extras
Install with the EasyBuild-user module in partition/container
:
To access module help after installation use module spider PyTorch/<version>
.
EasyConfig:
-
EasyConfig PyTorch-2.0.1-rocm-5.5.1-python-3.10-debugsymbols-singularity-20231110.eb, will provide PyTorch/2.0.1-rocm-5.5.1-python-3.10-debugsymbols-singularity-20231110 (with docker definition)
Contains PyTorch 2.0.1 with torchaudio 2.0.2+31de77d, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036 and torchvision 0.15.2a0+fa99a53 GPU version, on Python 3.10 and ROCm 5.5.1.
-
Contains PyTorch 2.0.1 with torchaudio 2.0.2+31de77d, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036 and torchvision 0.15.2a0+fa99a53 GPU version, on Python 3.10 and ROCm 5.5.1.
-
EasyConfig PyTorch-2.0.1-rocm-5.5.1-python-3.10-singularity-20231110.eb, will provide PyTorch/2.0.1-rocm-5.5.1-python-3.10-singularity-20231110 (with docker definition)
Contains PyTorch 2.0.1 with torchaudio 2.0.2+31de77d, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036 and torchvision 0.15.2a0+fa99a53 GPU version, on Python 3.10 and ROCm 5.5.1.
-
Contains PyTorch 2.0.1 with torchaudio 2.0.2+31de77d, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036 and torchvision 0.16.0+a90e584 GPU version, on Python 3.10 and ROCm 5.5.1.
-
Contains PyTorch 2.1.0 with torchaudio 2.1.0+420d9ac, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036, torchvision 0.16.0+a90e584 GPU version, on Python 3.10 and ROCm 5.6.1.
-
Contains PyTorch 2.1.0 with torchaudio 2.1.0+420d9ac, torchdata 0.6.1+e1feeb2, torchtext 0.15.2a0+4571036, torchvision 0.16.0+a90e584 GPU version and DeepSpeed 0.12.3, on Python 3.10 and ROCm 5.6.1.
-
Contains PyTorch 2.2.0 with torchaudio 2.2.0, torchdata 0.7.1+cpu, torchtext 0.17.0+cpu, torchvision 0.17.0 GPU version and DeepSpeed 0.12.3, on Python 3.10 and ROCm 5.6.1.
-
Contains PyTorch 2.2.0 with torchaudio 2.2.0, torchdata 0.7.1+cpu, torchtext 0.17.0+cpu, torchvision 0.17.0 GPU version and, DeepSpeed 0.12.3, flash-attention 2.0.4 and xformers 0.0.25+8dd471d.d20240209, on Python 3.10 and ROCm 5.6.1.
-
Contains PyTorch 2.2.0 with torchaudio 2.2.0, torchdata 0.7.1+cpu, torchtext 0.17.0+cpu, torchvision 0.17.0 GPU version, DeepSpeed 0.12.3, flash-attention 2.0.4 and xformers 0.0.25+8dd471d.d20240209, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda / Python venv / or both environments respectively.
-
Contains PyTorch 2.2.0 with torchaudio 2.2.0, torchdata 0.7.1+cpu, torchtext 0.17.0+cpu, torchvision 0.17.0 GPU version, DeepSpeed 0.12.3, flash-attention 2.0.4 and xformers 0.0.25+8dd471d.d20240209, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
As an example of how installation in the virtual environment can be automated through EasyBuild, torchmetrics and pytorch-lightning are installed in the virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda / Python venv / or both environments respectively.
This environment is experimental and only meant as an example of what can be done, but may not be fully functional for everybody. Most users should use the non-exampleVenv versions.
-
Contains PyTorch 2.2.2 with torchaudio 2.2.2, torchdata 0.7.1+cpu, torchtext 0.17.2+cpu, torchvision 0.17.2 GPU version, DeepSpeed 0.14.0, flash-attention 2.0.4 and xformers 0.0.26+82368ac.d20240403, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda env, Python venv or both environments respectively.
-
Contains PyTorch 2.2.2 with torchaudio 2.2.2, torchdata 0.7.1, torchtext 0.17.2+cpu, torchvision 0.17.2 GPU version, DeepSpeed 0.14.0, flash-attention 2.0.4 and xformers 0.0.26+82368ac.d20240403, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda env, Python venv or both environments respectively.
-
Contains PyTorch 2.2.0 with torchaudio 2.2.0, torchdata 0.7.1+cpu, torchtext 0.17.0+cpu, torchvision 0.17.0 GPU version, torchmetrics 1.3.2, DeepSpeed 0.12.3, flash-attention 2.0.4, xformers 0.0.25+8dd471d.d20240209, and vllm 0.4.0.post1, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda env, Python venv or both environments respectively.
If you experience problems with this container, it may be better to move to a more recent one (see the date encoded as yyyymmdd at the end of the filename).
-
Contains PyTorch 2.2.2 with torchaudio 2.2.2, torchdata 0.7.1, torchtext 0.17.2+cpu, torchvision 0.17.2 GPU version, DeepSpeed 0.12.3, flash-attention 2.0.4, xformers 0.0.26+82368ac.d20240403, and vllm 0.4.0.post1, on Python 3.10 and ROCm 5.6.1. The container also fully assists the procedure to add extra packages in a Python virtual environment.
This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the conda env, Python venv or both environments respectively.
Technical documentation (user EasyBuild installation)
EasyBuild
Version 1.12.1 (archived)
-
The EasyConfig is a LUST development and based on wheels rather than compiling ourselves due to the difficulties of compiling PyTorch correctly. We do however use a version of the RCCL library installed through EasyBuild, with the aws-ofi-rccl plugin which is needed to get good performance on LUMI.
-
A different version of NumPy was needed as in the Cray Python module that is used. It is also installed from a wheel hence is not using the Cray Scientific Libraries for BLAS support.
Technical documentation (singularity container)
How to check what's in the container?
-
The Python, PyTorch and ROCm versions are included in the version of the module.
-
To find the version of Python packages,
after loading the module. This can even be done on the login nodes. It will return information about all Python packages.
-
Deepspeed:
-
Leaves a script 'deepspeed' in
/opt/miniconda3/envs/pytorch/bin
-
Leaves packages in
/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/deepspeed
-
Finding the version:
or the clumsy way without
pip
:singularity exec $SIF bash -c \ 'grep "version=" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/deepspeed/git_version_info_installed.py'
(Test can be done after loading the module on a login node.)
-
-
flash-attention and its fork, the ROCm port
-
Leaves a
flash_attn
and correspondingflash_attn-<version>.dit-info
subdirectory in/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages
. -
To find the version:
or the clumsy way without `pip:
singularity exec $SIF bash -c \ 'grep "__version__" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/flash_attn/__init__.py'
(Test can be done after loading the module on a login node.)
To run a benchmark:
-
-
xformers:
-
Leaves a
xformers
and correspondingxformers-<version>.disti-info
subdirectory
in/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages
. -
To find the version:
or the clumsy way without
pip
:singularity exec $SIF bash -c \ 'grep "__version__" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/xformers/version.py'
(Test can be done after loading the module on a login node.)
-
Checking the features of
xformers
:
-
Archived EasyConfigs
The EasyConfigs below are additonal easyconfigs that are not directly available on the system for installation. Users are advised to use the newer ones and these archived ones are unsupported. They are still provided as a source of information should you need this, e.g., to understand the configuration that was used for earlier work on the system.
- Archived EasyConfigs from LUMI-EasyBuild-contrib - previously user-installable software