Exercises 2: Running jobs with Slurm¶
Exercises on the Slurm allocation modes¶
-
Run single task on the CPU partition with
srunusing multiple cpu cores. Inspect default task allocation withtasksetcommand (taskset -cp $$will show you cpu numbers allocated to a current process).Click to see the solution.
srun --partition=small --nodes=1 --tasks=1 --cpus-per-task=16 --time=5 --partition=small --account=<project_id> bash -c 'taskset -cp $$'Note you need to replace
<project_id>with actual project account ID in a form ofproject_plus 9 digits number.The command runs single process (
bashshell with a native Linuxtasksettool showing process's CPU affinity) on a compute node. You can useman tasksetcommand to see how the tool works. -
Try Slurm allocations with
hybrid_checktool program from the LUMI Software Stack. The program is preinstalled on the system.Use the simple job script to run parallel program with multiple tasks (MPI ranks) and threads (OpenMP). Test task/threads affinity with
sbatchsubmission on the CPU partition.#!/bin/bash -l #SBATCH --partition=small # Partition (queue) name #SBATCH --nodes=1 # Total number of nodes #SBATCH --ntasks-per-node=8 # 8 MPI ranks per node #SBATCH --cpus-per-task=16 # 16 threads per task #SBATCH --time=5 # Run time (minutes) #SBATCH --account=<project_id> # Project for billing module load LUMI/22.12 module load lumi-CPEtools srun hybrid_check -n -rBe careful with copy/paste of script body while it may brake some specific characters.
Click to see the solution.
Save script contents into
job.shfile (you can usenanoconsole text editor for instance), remember to use valid project account name.Submit job script using
sbatchcommand.sbatch job.shThe job output is saved in the
slurm-<job_id>.outfile. You can view it's contents with eitherlessormoreshell commands.Actual task/threads affinity may depend on the specific OpenMP runtime but you should see "block" thread affinity as a default behaviour.
-
Improve threads affinity with OpenMP runtime variables. Alter your script and add MPI runtime variable to see another cpu mask summary.
Click to see the solution.
Export
SRUN_CPUS_PER_TASKenvironment variable to follow convention from recent Slurm's versions in your script. Add this line before thehybrid_checkcall:export SRUN_CPUS_PER_TASK=16Add OpenMP environment variables definition to your script:
export OMP_NUM_THREADS=${SRUN_CPUS_PER_TASK} export OMP_PROC_BIND=close export OMP_PLACES=coresYou can also add MPI runtime variable to see another cpu mask summary:
export MPICH_CPUMASK_DISPLAY=1Note
hybrid_checkand MPICH cpu mask may not be consistent. It is found to be confusing. -
Build
hello_jobstepprogram tool using interactive shell on a GPU node. You can pull the source code for the program from git repositoryhttps://code.ornl.gov/olcf/hello_jobstep.git. It usesMakefilefor building. Try to run the program interactively.Click to see the solution.
Clone the code using
gitcommand:git clone https://code.ornl.gov/olcf/hello_jobstep.gitIt will create
hello_jobstepdirectory consisting source code andMakefile.Allocate resources for a single task with a single GPU with
salloc:salloc --partition=small-g --nodes=1 --tasks=1 --cpus-per-task=1 --gpus-per-node=1 --time=10 --account=<project_id>Note that, after allocation being granted, you receive new shell but still on the compute node. You need to use
srunto execute on the allocated node.Start interactive session on a GPU node:
srun --pty bash -iNote now you are on the compute node.
--ptyoption for srun is required to interact with the remote shell.Enter the
hello_jobstepdirectory and issuemakecommand. It will fail without additional options and modules.module load rocmNote compiler (and entire programming environment) is the one you have set (or not) in the origin shell on the login node.
Nevertheless
rocmmodule is required to build code for GPU.make LMOD_SYSTEM_NAME="frontier"You need to add
LMOD_SYSTEM_NAME="frontier"variable for make while the code originates from the Frontier system.You can exercise to fix
Makefileand enable it for LUMI :)Eventually you can just execute
./hello_jobstepbinary program to see how it behaves:./hello_jobstepNote executing the program with
srunin the srun interactive session will result in a hang. You need to work with--overlapoption for srun to mitigate it.Still remember to terminate your interactive session with
exitcommand.exit
Slurm custom binding on GPU nodes¶
-
Allocate one GPU node with one task per GPU and bind tasks to each CCD (8-core group sharing L3 cache) leaving first (#0) and last (#7) cores unused. Run a program with 6 threads per task and inspect actual task/threads affinity.
Click to see the solution.
Begin with the example from the slides with 7 cores per task:
#!/bin/bash -l #SBATCH --partition=standard-g # Partition (queue) name #SBATCH --nodes=1 # Total number of nodes #SBATCH --ntasks-per-node=8 # 8 MPI ranks per node #SBATCH --gpus-per-node=8 # Allocate one gpu per MPI rank #SBATCH --time=5 # Run time (minutes) #SBATCH --account=<project_id> # Project for billing #SBATCH --hint=nomultithread cat << EOF > select_gpu #!/bin/bash export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID exec \$* EOF chmod +x ./select_gpu CPU_BIND="mask_cpu:0xfe000000000000,0xfe00000000000000," CPU_BIND="${CPU_BIND}0xfe0000,0xfe000000," CPU_BIND="${CPU_BIND}0xfe,0xfe00," CPU_BIND="${CPU_BIND}0xfe00000000,0xfe0000000000" export OMP_NUM_THREADS=7 export OMP_PROC_BIND=close export OMP_PLACES=cores export MPICH_CPUMASK_DISPLAY=1 srun --cpu-bind=${CPU_BIND} ./select_gpu ./hello_jobstep/hello_jobstepIf you save the script in the
job_step.shthen simply submit it with sbatch. Inspect the job output.Now you would need to alter masks to disable 7th core of each of the group (CCD). Base mask is then
01111110which is0x7ein hexadecimal notation.Try to apply new bitmask, change the corresponding variable to spawn 6 threads per task and check how new binding works.