Aspect performance using AMD Rome Processors?

I have an opportunity to buy into a cluster with AMD Rome processors (2 GHz). The nodes will have 64 cores per socket, two sockets per node, and 256 GB of memory per node (4GB per core). Does anyone have any experience using Aspect (or other deal.II codes) with this processor/node configuration? My concern is that 64 cores per socket could be a bottle-neck at the socket and/or bus level of the board (or whatever one calls it these days). Looking for advice.
Scott

1 Like

Hi Scott,

The newest cluster at Davis uses the AMD Epyc 7351 chip and I believe has the same number of cores per socket.

After resolving an issue with MKL configuration on the AMD chips, we got excellent scaling results on a single node. We also had to upgrade to openmpi 4.0.1 to get stable run time behavior over multiple nodes, but now things appear to be running well.

Max Rudolph uses this cluster for very large simulations and he may be able to provide more useful or up-to-date information.

Cheers,
John

@sdk what John says is correct - our new cluster uses AMD Epyc chips. Our sysadmin told me that the next generation of these processors is even faster. The processors that we have are 16 cores per socket (32 hardware threads) and 2 sockets per node. With aspect, I’ve found that using 32 tasks per node provides the best performance. 64 tasks per node is slightly slower. If there is a specific test or result that you’d like to have run I’m happy to do so. I ran Rene’s aspect-performance-statistics on this cluster and the other machine that I use frequently (Xeon cluster at portland state) and could post some comparison plots.

Hi Scott,
The topic is a bit complicated, because of the large number of cores per socket on your system. In general as John and Max have written the performance of recent AMD processors is great (Juliane just bought a workstation with Epyc’s too, no Romes available at the time we ordered it). But because the ASPECT Stokes solver with AMG preconditioner is memory-bound you may indeed observe that running on 32 cores might be just as fast as running with 64 cores (as long as all cores are on one socket and you only look at Stokes solver time). That being said, the performance per socket (independent of number of cores) of the AMD chips should always be faster than Intel pendants, because they use a 8-channel memory interface (instead of 6 for Intel).
If you use the new GMG preconditioner the story changes but is also complicated. On the one hand the memory limitation is removed, benefiting the high core count of AMD chips, on the other hand the AMD chips do not yet support AVX512 (which Intel chips do), which would nearly double the computations per cycle.
So in summary AMDs work great, but if they are faster or slower than Intel chips depend on a number of things like bandwidth per core. If you plot performance over number of sockets used then AMD should always be ahead at the moment.

Rene, Max, and John;

Thanks a lot for the info. It helps. The intel vs. AMD issue is not open for negotiation. This would be buying a “share” of a big cluster and having dedicated queue access. My question was really whether the architecture could handle running all the cores full out and it seems, much like the intel machines I’ve been using, the answer is no. Still I think it’s a good deal for me.

Scott