Coupling machine learning with HPC simulation¶

Presenter: Harvey Richardson(HPE)
Co-author: Alessandro Rigazzi (formerly HPE)

Extra materials¶

Presentation slides
Recent LUMI User Coffee Break seminar on SmartSim
Recent LUMI SmartSim training
References from the slides:
- Language Interoperability
  - f2py and fmodpy intro
  - forpy
  - pybind11
  - Talk: "Reducing the overhead of coupling machine learning models between Python and Fortran" YouTube video and slides
- Interoperability at framework level:
  - FTorch: Torch inference library for Fortran on GitHub
  - Fortran Keras Bridge, tensorflow but not very active on GitHub and paper on arXiv
- CrayLabs SmartSim:
- Examples
  - AI and Weather Forcasting at ECMWF
  - Using SmartSim in MOM6 for turbulence modeling: Partee et al. [2022]: Using Machine Learning at scale in numerical simulations with SmartSim: An application to ocean climate modeling
  - Andrew Mole et al. [2025]: Reinforcement Learning Increases Wind Farm Power Production by Enabling Closed-Loop Collaborative Control
  - Co-development of a SmartSim module for OpenFOAM
  - Examples from Density Functional Theory
    - Pure non-local machine-learned density functional theory for electron correlation
    - Large-Scale Materials Modeling at Quantum Accuracy
  - DeepDriveMD example of selecting "promising" protein folding solutions from an ensemble
    - Paper DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein Folding
  - Nobel Prize in Chemistry 2024 for protein folding using Alphafold and other tools
    - Nobel press release and additional technical document 1 and additional technical document 2

Q&A¶

In the workflow shown in the slides, the simulation sends tensors to SmartRedis, the ML model processes them, and the results are sent back to the simulation.

If we scale this to many simulation replicas or large tensor data, what usually becomes the main bottleneck: data transfer through SmartRedis, the in-memory database/orchestrator, the ML model execution, or the placement of tasks in Slurm?
- There is a bottleneck in two areas, the TCP networking and the fact that the orchestrator can only use one GPU as currently architected. To get around the latter you would have to make the model be a client. These issues are going to be addressed with a new version that is being architected with a collaboration with two US universities and that change is likely to happen in the next couple of months.
More specifically, how should we profile such a coupled simulation–AI workflow to distinguish between communication overhead, database/orchestrator saturation, and GPU model-inference bottlenecks?

Are there recommended best practices on LUMI for placing the simulation, SmartSim database, and ML model processes to reduce data-movement overhead? For example, should the database/orchestrator be placed close to the simulation tasks, close to the GPU model workers, or distributed in some way when scaling to many ensemble members?
- We organized a SmartSim workshop, with all the slides and lectures being published here: https://lumi-supercomputer.github.io/LUMI-training-materials/smartsim-20260331/
- If you have a scenario where something like SmartSim could be appropriate then please get in touch and we can connect you with the developers and I'm sure they would be happy to have a further discussion. This could be done at any time by putting in a LUMI ticket and asking to connect to HPE CoE people.