ASPECT hangs at random spots when using more than one node

Hello all,

I rectenly installed ASPECT on a new institutional cluster.

– This is ASPECT, the Advanced Solver for Problems in Earth’s ConvecTion.
– . version 2.5.0-pre (main, 4c20fe0)
– . using deal.II 9.4.2
– . with 32 bit indices and vectorization level 1 (128 bits)
– . using Trilinos 13.2.0
– . using p4est 2.3.2
– . running in OPTIMIZED mode
– . running with 128 MPI processes

The codes (including cookbooks) run well on a single node from 16 to 56 cores, but as soon as I use more than one node, ASPECT randomly hangs at different time steps, regardless of adaptive gridding or not. Both OPTIMIZED and DEBUG versions exhibit the same behavior. For example, it hangs at:

Number of mesh deformation degrees of freedom: 451128
*** Timestep 56: t=2.79614e+06 years, dt=46144.5 years
Solving mesh displacement system… 13 iterations.

or at

Number of mesh deformation degrees of freedom: 451128
*** Timestep 13: t=650000 years, dt=50000 years
Solving mesh displacement system… 13 iterations.
Solving temperature system… 15 iterations.
Solving noninitial_plastic_strain system … 20 iterations.
Solving plastic_strain system … 19 iterations.
Solving crust_upper system … 16 iterations.
Solving crust_lower system … 17 iterations.
Solving mantle_lithosphere system … 15 iterations.
Solving asthenosphere system … 15 iterations.

or at

Number of mesh deformation degrees of freedom: 87039
Solving mesh displacement system… 0 iterations.
*** Timestep 0: t=0 years, dt=0 years
Solving mesh displacement system… 0 iterations.
Solving temperature system… 0 iterations.
Skipping noninitial_plastic_strain composition solve because RHS is zero.
Solving plastic_strain system … 0 iterations.
Solving crust_upper system … 0 iterations.
Solving crust_lower system … 0 iterations.
Solving mantle_lithosphere system … 0 iterations.
Solving asthenosphere system … 0 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 57+0 iterations.
Relative nonlinear residual (Stokes system) after nonlinear iteration 1: 1

Rebuilding Stokes preconditioner…
Solving Stokes system… 64+0 iterations.
Relative nonlinear residual (Stokes system) after nonlinear iteration 2: 0.0816698

Rebuilding Stokes preconditioner…

The cluster uses openmpi, 3.1.4 built with gnu 8.3. I do make sure to run on physical cores only and OMP_NUM_THREADS=1. I also tried with exclusive nodes to no avail. Any ideas?

Thanks,
Rob

Hi Robert,

Uh oh :frowning:

When we have seen these types of nasty random errors in the past, it is usually related to some obscure issue with the MPI version, node interconnect hardware, and/or file server.

Do you know if the file server is compatible with MPI-IO (i.e., is it parallel file server that uses lustre or BGFS)?

My first thought is that there are newer versions of openmpi available, and in one past similar case upgrading to a newer version resolved a similar issue. My recollection is that had to with something related to openmpi version compatibility with the infiniband hardware, but it could be a number of different things in this case.

Others will likely chime in as well.

Hopefully there is a quick solution, but this may take a bit of time to pinpoint the specific issue.

Cheers,
John

From the output it looks like there is not one particular place that causes the hang (solver or postprocessing), so this might be difficult to diagnose.

A few things to try:

  • Does it happen if you run a small example in debug mode (convection box or something)
  • Does your system have more than one file system? If yes, try to put the output folder on a different filesystem to test
  • If you are running in debug mode you can, at least on most clusters, connect to one of the compute nodes using ssh and generate a call stack. That would tell us where things are hanging.

Hi Timo and John,

  • Does it happen if you run a small example in debug mode (convection box or something)

It does not appear to happen with convection-box, ran it distributed in different configurations multiple times without issues.

  • Does your system have more than one file system? If yes, try to put the output folder on a different filesystem to test

Our cluster only uses NFS4.1

I’ll try with openmpi4 to see if that resolves the issue. Or would you suggest MPICH since we are using NFS?

Thanks,
Rob

One potential issues might be parallel MPI I/O. You can disable it by setting “Number of grouped files = 0”, see Postprocess — ASPECT 2.5.0-pre

I have no preference for OpenMPI vs MPICH. Try different things and see if it helps…

Our cluster only uses NFS4.1

One potential issues might be parallel MPI I/O. You can disable it by setting “Number of grouped files = 0”, see Postprocess — ASPECT 2.5.0-pre

Yep, this is the issue a number of us encountered on a cluster that had NFS and a version of MPI (openmpi) that did not include MPI I/0.

The solution that finally resolved the random segmentation faults was using a version openmpi with MPI I/0, but this is not standard procedure (i.e., not recommended) and can lead to other issues.

I agree with Timo that the best solution would be to disable ASPECT using MPI I/O via Number of grouped files = 0

Cheers,
John

Setting Number of grouped files = 0 did not resolve the issue. Here is the call stack for the stuck process (this is outside of my wheelhouse):

(gdb) where
#0 0x00007f5eb8afb3c6 in mca_btl_vader_component_progress () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/openmpi/mca_btl_vader.so
#1 0x00007f5ecc02206c in opal_progress () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libopen-pal.so.40
#2 0x00007f5ecc0287d5 in ompi_sync_wait_mt () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libopen-pal.so.40
#3 0x00007f5ece8975b9 in ompi_request_default_wait () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#4 0x00007f5ece8ec8e3 in ompi_coll_base_sendrecv_actual () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#5 0x00007f5ece8eccbc in ompi_coll_base_allreduce_intra_recursivedoubling () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#6 0x00007f5ece8aded6 in PMPI_Allreduce () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#7 0x00007f5edbce2c19 in Epetra_MpiComm::MaxAll (this=, PartialMaxs=, GlobalMaxs=, Count=)
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_MpiComm.cpp:161
#8 0x00007f5edbc205aa in Epetra_CrsGraph::ComputeIndexState (this=this@entry=0x8f5f2f0)
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_BlockMap.h:770
#9 0x00007f5edbc251dd in Epetra_CrsGraph::MakeIndicesLocal(Epetra_BlockMap const&, Epetra_BlockMap const&) ()
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_CrsGraph.cpp:1774
#10 0x00007f5edbc25a16 in Epetra_CrsGraph::FillComplete(Epetra_BlockMap const&, Epetra_BlockMap const&) ()
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_CrsGraph.cpp:979
#11 0x00007f5efced86ac in dealii::TrilinosWrappers::SparsityPattern::compress (this=0xa6d5cc0) at /home/rmoucha/bin/tmp/unpack/deal.II-v9.4.2/source/lac/trilinos_sparsity_pattern.cc:738
#12 0x00007f5efca458bf in dealii::BlockSparsityPatternBasedealii::TrilinosWrappers::SparsityPattern::compress (this=0x7fff4c610b00)
at /home/rmoucha/bin/tmp/unpack/deal.II-v9.4.2/source/lac/block_sparsity_pattern.cc:172
#13 0x00000000043768cf in aspect::Simulator<3>::setup_system_matrix (this=0x7fff4c611700, system_partitioning=…) at /home/rmoucha/opt/aspect/source/simulator/core.cc:1143
#14 0x0000000004373723 in aspect::Simulator<3>::start_timestep (this=0x7fff4c611700) at /home/rmoucha/opt/aspect/source/simulator/core.cc:633
#15 0x0000000004370d18 in aspect::Simulator<3>::run (this=0x7fff4c611700) at /home/rmoucha/opt/aspect/source/simulator/core.cc:2030
#16 0x0000000002b55f23 in run_simulator<3> (raw_input_as_string=…, input_as_string=…, output_xml=false, output_plugin_graph=false, validate_only=false)
at /home/rmoucha/opt/aspect/source/main.cc:592
#17 0x0000000002b2f63f in main (argc=2, argv=0x7fff4c615918) at /home/rmoucha/opt/aspect/source/main.cc:784

Here is a different hang point:

#0 0x00007fb61a19ef0e in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1 0x00007fb604d3529d in ompi_coll_libnbc_progress ()
from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/openmpi/mca_coll_libnbc.so
#2 0x00007fb61956c06c in opal_progress () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libopen-pal.so.40
#3 0x00007fb6195727d5 in ompi_sync_wait_mt () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libopen-pal.so.40
#4 0x00007fb61bde15b9 in ompi_request_default_wait () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#5 0x00007fb61be368e3 in ompi_coll_base_sendrecv_actual () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#6 0x00007fb61be36cbc in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#7 0x00007fb61bdf7ed6 in PMPI_Allreduce () from /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib/libmpi.so.40
#8 0x00007fb62922cc19 in Epetra_MpiComm::MaxAll (this=, PartialMaxs=,
GlobalMaxs=, Count=)
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_MpiComm.cpp:161
#9 0x00007fb62916a5aa in Epetra_CrsGraph::ComputeIndexState (this=this@entry=0x1a445970)
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_BlockMap.h:770
#10 0x00007fb62916f1dd in Epetra_CrsGraph::MakeIndicesLocal(Epetra_BlockMap const&, Epetra_BlockMap const&) ()
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_CrsGraph.cpp:1774
#11 0x00007fb62916fa16 in Epetra_CrsGraph::FillComplete(Epetra_BlockMap const&, Epetra_BlockMap const&) ()
at /home/rmoucha/bin/tmp/unpack/Trilinos-trilinos-release-13-2-0/packages/epetra/src/Epetra_CrsGraph.cpp:979
#12 0x00007fb64a4226ac in dealii::TrilinosWrappers::SparsityPattern::compress (this=0x10409f60)
at /home/rmoucha/bin/tmp/unpack/deal.II-v9.4.2/source/lac/trilinos_sparsity_pattern.cc:738
#13 0x00007fb649f8f8bf in dealii::BlockSparsityPatternBasedealii::TrilinosWrappers::SparsityPattern::compress (
this=0x7ffd2e4471d0) at /home/rmoucha/bin/tmp/unpack/deal.II-v9.4.2/source/lac/block_sparsity_pattern.cc:172
#14 0x00000000043768cf in aspect::Simulator<3>::setup_system_matrix (this=0x7ffd2e447dd0, system_partitioning=…)
at /home/rmoucha/opt/aspect/source/simulator/core.cc:1143
#15 0x0000000004373723 in aspect::Simulator<3>::start_timestep (this=0x7ffd2e447dd0)
at /home/rmoucha/opt/aspect/source/simulator/core.cc:633
#16 0x0000000004370d18 in aspect::Simulator<3>::run (this=0x7ffd2e447dd0)
at /home/rmoucha/opt/aspect/source/simulator/core.cc:2030
#17 0x0000000002b55f23 in run_simulator<3> (raw_input_as_string=…, input_as_string=…, output_xml=false,
output_plugin_graph=false, validate_only=false) at /home/rmoucha/opt/aspect/source/main.cc:592
#18 0x0000000002b2f63f in main (argc=2, argv=0x7ffd2e44bfe8) at /home/rmoucha/opt/aspect/source/main.cc:784

The call stacks you produced are inside Trilinos, but this does not necessarily mean that the bug is found there. This is because if one process does something different, every other process will hang in the next communication step (which very likely is inside linear algebra routines like you see).
Are you using an unmodified prm file from ASPECT (which one?)? What is the minimum number of MPI ranks to see the hangs?
It is unlikely that we can reproduce your case, but we can try. Otherwise, you will need to do more experiments:

  1. Try to simplify the case (fewer MPI ranks, smaller prm file)
  2. Try to produce the call stack of all MPI ranks for a specific run (like you did above), likely all except one of them are inside Trilinos and will look identical.