The memory pyramid¶
We've seen several types of memory now, and that is structured in a hierarchy. The typical hierarchy for a regular CPu is depicted above.
-
Each core has registers for integer data, floating point data and/or vectors, and some special registers. The architectural register set is rather small but as this limits parallel execution inside the processor, modern processors have a lot of hidden register space that is used through an automatic mechanism called "register renaming". E.g., the AMD Rome or Milan processors have less than 1 kB of register space in the instruction set, but the total physical register space (used with register renaming) is over 6 kB.
-
Processors typically have separate L1 caches for instructions and for data (where the data one is sometimes bypassed when using vector instructions).
-
The L2 cache in processors for scientific computing is typically on a per core basis.
-
The L3 cache is shared by a number of cores, sometimes even by all cores in a socket
-
The next level in the hierarchy is the main RAM memory, which due to the NUMA architecture may appear as multiple levels with different latency and bandwidth
-
The last level is one that will actually be de-activated on many supercomputers as it is too slow: When RAM memory is exhausted, the OS can swap some of the content to disk storage. But even without swapping enabled we can still see the disk as an additional storage space, but then one that we need to manage ourselves in software.
Some supercomputers will offer several types of disk storage. Typically a smaller volume based on SSD technology and a slower but larger volume based on regular hard disks.
Even this picture is still oversimplified. Processors have other types of small caches, but those are harder to exploit by our programming style so we should not really take them into account here. RAM memory may also exist out of multiple layers instead of the one depicted in the picture.
-
The Intel Xeon Max 9xxx series has 64 GB of high-bandwidth RAM on each socket and then up to 1 TB or more of lower bandwidth memory connected externally to the socket.
This chip is used in a small subsection of the EuroHPC pre-exascale system MareNostrum 5 (and were planned for the pre-exascale system Leonardo also but it is not clear at the moment if they will be in the final machine).
-
Recently for large memory computers, and especially for database systems, a new type of RAM package was proposed that is connected to an expansion card on the CXL-bus, an evolution of the PCIe bus (and IBM has its own alternative in its CPUs). Obviously this memory sits even further away from the CPU so we can expect a higher latency and less bandwidth. Given that scientific computing applications are often already memory latency and/or memory bandwidth limited, this memory type may never become very useful in supercomputers.
Let us compare the memory hierarchy for the "Leibniz" cluster and AMD Rome section of Vaughan of the UAntwerpen infrastructure for 2023, and then with Hawk, one of the largest CPU-only systems in Europe, at HLRS in Germany, and also based on the AMD Rome CPU
Leibniz "Broadwell" | Vaughan "Rome" | Hawk "Rome" | |
---|---|---|---|
architectural registers | <1 kiB/HW thread | <1 kiB/HW thread | <1 kiB/HW thread |
physical registers | ~7 kiB | ~6 kiB | ~6 kiB |
L1 cache | 2 x 32 kiB/core (I+D) | 2 x 32 kiB/core (I+D) | 2 x 32 kiB/core (I+D) |
L2 cache | 256 kiB/core | 512 kiB/core | 512 kiB/core |
L3 cache | 35 MiB/socket | 128 MiB/socket | 256 MiB/socket |
L3 organization | 35 MiB/socket | 16 MiB / 4 cores | 16 MiB / 4 cores |
RAM | 128 or 256 GiB/node | 256 GiB/node | 256 GiB/node |
disk swap | disabled | disabled | disabled |
disk storage | 600 TB | 600 TB | 26 PB |
size cluster | 152 nodes | 152 nodes | 5632 nodes |
So we see that there is in fact very little difference in the hierarchy for a university level supercomputer as Leibniz and Vaughan and a very large national or international supercomputer. However, the total size is very different.