Hello! I recently got into using ASPECT and been testing different cookbook examples since that. Most of them work just fine with any execution configuration but at least cookbook example crustal_model_2D.rpm gets stuck in the beginning while trying run it using release build and more than one MPI processes (which are physical CPU cores in my case).
By getting stuck I mean it never gets past initial output telling ASPECT version, number of MPI processes etc. However, debug build executable works as expected with crustal_model_2D.rpm using any given number of MPI processes. Also, release build works with other examples that I have tried, with more one than one MPI processes as well.
I suspect this might be related to the issue mentioned in ASPECT manual section 4.5.3, though output files are rather small. I didn’t find other tips about that than setting Number of grouped files to 1 but it didn’t work out either. Our cluster is using Infiniband interconnect with MVAPICH2 implementation of MPI, and 16 CPU cores per a node, if that matters.
Got into that issue just today, so I’m gonna keep investigating this issue but anyone have any (known) solutions or tips, I appreciate to hear them.
Sure. When it gets stuck, the output is nothing but
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
-- . version 2.3.0
-- . using deal.II 9.3.3
-- . with 32 bit indices and vectorization level 1 (128 bits)
-- . using Trilinos 12.18.1
-- . using p4est 2.3.2
-- . running in OPTIMIZED mode
-- . running with 2 MPI processes
-----------------------------------------------------------------------------
detailed.log for release build:
###
#
# ASPECT configuration:
# ASPECT_VERSION: 2.3.0
# GIT REVISION: ()
# CMAKE_BUILD_TYPE: Release
#
# DEAL_II_DIR: /data/home/leevi/software/dealii-candi/deal.II-v9.3.3/lib/cmake/deal.II
# DEAL_II VERSION: 9.3.3
# ASPECT_USE_PETSC: OFF
# ASPECT_USE_FP_EXCEPTIONS: ON
# ASPECT_RUN_ALL_TESTS: OFF
# ASPECT_USE_SHARED_LIBS: ON
# ASPECT_HAVE_LINK_H: ON
# ASPECT_WITH_LIBDAP: OFF
# ASPECT_WITH_WORLD_BUILDER: ON /data/home/leevi/aspect-2.3.0/contrib/world_builder
# ASPECT_PRECOMPILE_HEADERS: ON
# ASPECT_UNITY_BUILD: ON
#
# CMAKE_INSTALL_PREFIX: /usr/local
# CMAKE_SOURCE_DIR: /data/home/leevi/aspect-2.3.0
# CMAKE_BINARY_DIR: /data/home/leevi/aspect-2.3.0/build
# CMAKE_CXX_COMPILER: Intel 20.2.5.20211109 on platform Linux x86_64
# /apps/local/mvapich2-2.3.7/bin/mpicxx
# PARAMETER_GUI_EXECUTABLE: PARAMETER_GUI_EXECUTABLE-NOTFOUND
#
# LINKAGE: DYNAMIC
#
# COMPILE_FLAGS:
#
# _WITH_CXX14: ON
# _WITH_CXX17: FALSE
# _MPI_VERSION: 3.1
# _WITH_64BIT_INDICES: OFF
#
###
…and for debug build:
###
#
# ASPECT configuration:
# ASPECT_VERSION: 2.3.0
# GIT REVISION: ()
# CMAKE_BUILD_TYPE: Debug
#
# DEAL_II_DIR: /data/home/leevi/software/dealii-candi/deal.II-v9.3.3/lib/cmake/deal.II
# DEAL_II VERSION: 9.3.3
# ASPECT_USE_PETSC: OFF
# ASPECT_USE_FP_EXCEPTIONS: ON
# ASPECT_RUN_ALL_TESTS: OFF
# ASPECT_USE_SHARED_LIBS: ON
# ASPECT_HAVE_LINK_H: ON
# ASPECT_WITH_LIBDAP: OFF
# ASPECT_WITH_WORLD_BUILDER: ON /data/home/leevi/aspect-2.3.0/contrib/world_builder
# ASPECT_PRECOMPILE_HEADERS: ON
# ASPECT_UNITY_BUILD: ON
#
# CMAKE_INSTALL_PREFIX: /usr/local
# CMAKE_SOURCE_DIR: /data/home/leevi/aspect-2.3.0
# CMAKE_BINARY_DIR: /data/home/leevi/aspect-2.3.0/build
# CMAKE_CXX_COMPILER: Intel 20.2.5.20211109 on platform Linux x86_64
# /apps/local/mvapich2-2.3.7/bin/mpicxx
# PARAMETER_GUI_EXECUTABLE: PARAMETER_GUI_EXECUTABLE-NOTFOUND
#
# LINKAGE: DYNAMIC
#
# COMPILE_FLAGS:
#
# _WITH_CXX14: ON
# _WITH_CXX17: FALSE
# _MPI_VERSION: 3.1
# _WITH_64BIT_INDICES: OFF
#
###
Nothing looks wrong on first glance. Does mpirun -n 2 ./aspect -v exit correctly after printing the header? Are there different filesystems you can try to run on available on your machine? Did you try the “convection-box.prm” cookbook?
Okay, I got it working. The solution was increasing X repetitions and defining Y repetitions. Values of both parameters must be also increased in order to get crustal_model_2D running with greater amount of MPI processes.
So I believe this is fundamentally hardware issue, i.e. too much offset between interconnect and disks. And too coarse grid somehow then gets MPI communication stuck.