Segmentation fault while rebuilding preconditioner

Hello,
I just updated and recompiled Aspect last week and I decided to rerun a model that I new
was working last fall when I last got to work on Aspect things. Unfortunately, I get a segmentation fault during the rebuilding of the preconditioner on time step 726 (see below). This is not the first time I’ve had a similar segmentation fault… the same thing happened many months ago when I had my own version of the visco-plastic module, and so I stopped using that and this model just uses what is in the main branch of Aspect. In that case, this was happening at different points in model run and did not always happen (ah yes, this is the best kind of bug!).

Can I get some advice/help to figure out the source of the segmentation fault…
I’ve attached the parameter file and the stdout/stderr file with the segfault error, and below is the snippet from the log file showing where the segmentation fault occurs.

test1.prm (12.0 KB)

test1-604-c12-7.txt (381.1 KB)

*** Timestep 726: t=2.85897e+07 years

Solving temperature system… 8 iterations.

Solving spcrust system … 6 iterations.

Solving spharz system … 6 iterations.

Solving opcrust system … 3 iterations.

Solving opharz system … 4 iterations.

Rebuilding Stokes preconditioner…

[c12-11:30781] *** Process received signal ***

[c12-11:30781] Signal: Segmentation fault (11)

[c12-11:30781] Signal code: (128)

[c12-11:30781] Failing at address: (nil)

[c12-11:30781] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f544d8cb890]

[c12-11:30781] [ 1] /share/apps/openmpi-3.1.0/gcc7/lib/openmpi/mca_btl_vader.so(+0x4780)[0x7f54356da780]

[c12-11:30781] [ 2] /share/apps/openmpi-3.1.0/gcc7/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f54475995bc]

[c12-11:30781] [ 3] /share/apps/openmpi-3.1.0/gcc7/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f54475a0055]

[c12-11:30781] [ 4] /share/apps/openmpi-3.1.0/gcc7/lib/libmpi.so.40(ompi_request_default_wait+0x1e7)[0x7f544db23507]

[c12-11:30781] [ 5] /share/apps/openmpi-3.1.0/gcc7/lib/libmpi.so.40(PMPI_Wait+0x61)[0x7f544db6b231]

[c12-11:30781] [ 6] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_Comm_CheapWait+0x36)[0x7f544a684f46]

[c12-11:30781] [ 7] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_exchange_bdry+0x237)[0x7f544a68a0a7]

[c12-11:30781] [ 8] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(CSR_matvec+0x211)[0x7f544a6dcb41]

[c12-11:30781] [ 9] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_Operator_ApplyAndResetBdryPts+0x31)[0x7f544a6c9cf1]

[c12-11:30781] [10] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_Cycle_MG+0x41f)[0x7f544a67bc8f]

[c12-11:30781] [11] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_Cycle_MG+0x556)[0x7f544a67bdc6]

[c12-11:30781] [12] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ML_Cycle_MG+0x556)[0x7f544a67bdc6]

[c12-11:30781] [13] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libml.so.12(ZNK9ML_Epetra24MultiLevelPreconditioner12ApplyInverseERK18Epetra_MultiVectorRS1+0x822)[0x7f544a740ba2]

[c12-11:30781] [14] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libaztecoo.so.12(Epetra_Aztec_precond+0x208)[0x7f544ecb9c08]

[c12-11:30781] [15] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libaztecoo.so.12(AZ_pcg_f+0x8c4)[0x7f544ece3544]

[c12-11:30781] [16] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libaztecoo.so.12(AZ_oldsolve+0x4de)[0x7f544ed08bde]

[c12-11:30781] [17] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libaztecoo.so.12(AZ_iterate+0x147)[0x7f544ed097f7]

[c12-11:30781] [18] /share/apps/cig/dealii/v9.0.0/install/trilinos-release-12-10-1/lib/libaztecoo.so.12(_ZN7AztecOO7IterateExd+0xdc)[0x7f544ecb43ac]

[c12-11:30781] [19] /share/apps/cig/dealii/v9.0.0/install/deal.II-v9.0.0/lib/libdeal_II.g.so.9.0.0(ZN6dealii16TrilinosWrappers10SolverBase8do_solveINS0_16PreconditionBaseEEEvRKT+0xc1)[0x7f545693f6c9]

[c12-11:30781] [20] /share/apps/cig/dealii/v9.0.0/install/deal.II-v9.0.0/lib/libdeal_II.g.so.9.0.0(_ZN6dealii16TrilinosWrappers10SolverBase5solveERKNS0_12SparseMatrixERNS0_3MPI6VectorERKS6_RKNS0_16PreconditionBaseE+0x8d)[0x7f545693d1dd]

[c12-11:30781] [21] /home/billen/AspectProjects/aspect/build/aspect(ZNK6aspect8internal24BlockSchurPreconditionerIN6dealii16TrilinosWrappers15PreconditionAMGENS3_16PreconditionBaseEE5vmultERNS3_3MPI11BlockVectorERKS8+0x2b5)[0x55adcfff0711]

[c12-11:30781] [22] /home/billen/AspectProjects/aspect/build/aspect(ZN6dealii12SolverFGMRESINS_16TrilinosWrappers3MPI11BlockVectorEE5solveIN6aspect8internal11StokesBlockENS7_24BlockSchurPreconditionerINS1_15PreconditionAMGENS1_16PreconditionBaseEEEEEvRKT_RS3_RKS3_RKT0+0x5a9)[0x55adcfff1851]

[c12-11:30781] [23] /home/billen/AspectProjects/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE12solve_stokesEv+0x1bb6)[0x55adcfff477c]

[c12-11:30781] [24] /home/billen/AspectProjects/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE25assemble_and_solve_stokesEbPd+0x9f)[0x55adcfff8fbb]

[c12-11:30781] [25] /home/billen/AspectProjects/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE36solve_single_advection_single_stokesEv+0x5c)[0x55adcfffa442]

[c12-11:30781] [26] /home/billen/AspectProjects/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE14solve_timestepEv+0x168)[0x55adcfedf082]

[c12-11:30781] [27] /home/billen/AspectProjects/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x1019)[0x55adcff00de9]


Magali,

nothing suspicious in the logs that would explain the crash. We can try to run the same problem to see what is going on.

Does this happen with an identical setup at different times? Can you rerun this a couple times and report when it crashes?

A few comments:

– . running in DEBUG mode
– . running with 96 MPI processes
Number of degrees of freedom: 25,428 (4,290+561+2,145+4,608+4,608+4,608+4,608)

  1. You are losing out on a lot of performance by using debug mode. Generally speaking, any parallel run or anything that runs longer than a few minutes should be done in release mode. Of course, in this situation (crashing) it is helpful because we now know nothing else is going on.
  2. For 25,000 DoFs, using 96 MPI tasks is overkill. This would probably run much faster on a handful of tasks. We might also be running into problems because of that.

Oh, I just saw that you end up with 6 million DoF at the end. I take back the “overkill” comment.

Your computation runs for 33+ hours. What is the maximum allowed time on the cluster? What did you use in the job submission as maximum wall time?

Timo,
Yes, I figured out after starting it that it is a very small model only 250 MB of memory needed, but I had run it that way previously, so I just let it go. THis is my really old (10 year old ) cluster.

yes, I’ll switch to release mode… I just wanted to make sure everything was working… which it isn’t.

I reran the same model on a much newer cluster (3 years old) over the weekend and it also dies with a segmentation fault but at an earlier time-step. This was run on just 16 threads.

I will rerun it a few more times with fewer processors and in release mode to see if any
of that makes a difference and report back.
-Magali

Oops… just saw your other comments. My old cluster has not time limit, so there was nothing external that shut off the calculation. I also checked to make sure that none of the nodes died - they did not. Also, now we know the same error happens on the newer cluster.

Here’s the output from the newer cluster (for completeness).
Magalitest1-1486784-c5-70.txt (121.5 KB)

Okay. I am running it on my machine and will let you know if it crashes at some point.

So far it is working for me:

*** Timestep 619:  t=2.28648e+07 years

How’s you test run going? I had one run out time, a node died on another and the third one stop with no information. :frowning: I am re-running now with more processors and longer wall-clock time (this is on the newer cluster).

it ran to completion without errors.

Okay, so here is something interesting… I’m running two models that are identical except one is using Aspect compiled in release mode and the other is using is compiled in debug mode. The one in debug mode just had a Segmentation Fault (same as before… in the Rebuilding Preconditioner step). The one in release mode is already past the same point and is still running. So, the easy solution is not to run in debug mode unless I’m really debugging. The question is whether whatever “bug” this is ends up a problem down the line even in release mode. Any idea of what to do next? In a smaller program, this is where I’d put in “printf statements” :slight_smile:

Your descriptions sound a lot like problems with the cluster or the
system software in itself, rather than with the software you are running:

  • The point where the software fails is non-reproducible
  • The code that does more checking (debug mode) fails whereas the
    code that does less checking (release mode) does not fail

My suspicion is that always using release mode will only yield different
kinds of problems, but not be a reliable substitute.

These issues are really awkward. In similar cases in the past, I’ve
tried to run the same simulation (and same code) multiple times to see
whether there is a statistical pattern to what is going wrong. It’s
about the only suggestion I can make for how to attack the issue.

Best
W.

I don’t think this is necessarily cluster dependent, I’ve now run it on two different clusters. Also, although the exact time-step that it dies is not reproducible (although I need to try to run it again with the same number of threads and test that) it always dies in exactly same part of the code… rebuilding the preconditioner.

I am suspicious that there is a bug in the material module that I’m using… visco plastic. Am I correct in thinking that when rebuilding the preconditioner, it needs the the viscosity (the stiffness matrix)? I’m guessing that there are not so many people using this specific module, so there could be some bug that I’m just now finding because I trying to really use the module with all its bells and whistles (like the composition-dependent rheology).

Also, I was always taught that a segmentation fault basically means that you are trying to grab information from somewhere in memory that you shouldn’t… so somewhere you are counting wrong. Is that still true? Is it a good way to think about it or not (in terms of trying to track down the problem).

I am trying to think of ways to change the problems so that is still basically runs the same, but that it might help us to figure out what is triggering the problem, but that is more difficult.

Fundamentally, a “segmentation fault” means that the program is accessing a
memory location to which it doesn’t have access. There are many reasons why
this can happen:

  • One could try to read memory using a pointer that has not been initialized
    and that consequently points to memory that doesn’t exist, or that is owned by
    another program.

  • One could try to read memory using a pointer that used to point to valid
    memory, but where the pointer was overwritten with invalid values at some
    point during the program’s run. This is often very difficult to track down
    because the problem doesn’t manifest where the bug actually happened: Some
    part of the program may be writing past the end of an array, for example, and
    thereby overwrite the address stored in a pointer located behind the end of
    the array. But you will only notice next time you use this pointer to read
    from somewhere, which may happen many cycles later.

  • One could try to read memory using a pointer that looks like it has been
    properly initialized using, for example, malloc(). But if no memory is
    available, then malloc() simply returns a NULL pointer and writing to the
    address then stored in the pointer (NULL) will result in a segmentation fault.
    The reason why no memory is available may or may not have anything to do with
    the currently running program: It may also be due to the fact that other large
    programs are currently running on the same machine, or that a bug in a
    software layer like MPI, the C++ runtime layer, Trilinos, p4est, or some other
    involved part has led to the exhaustion of available memory.

None of this is easy to track down if the problem happens deep inside the MPI
implementation (and called via Trilinos) as in your case, and if one can’t
attach a debugger :frowning:

Best
W.

I doubt that this is the case. I assume it is likely a bug inside Trilinos or something specific to your cluster. Sadly.

Okay. So I re-ran the same model again, and it dies again but at an earlier time-step. This means that I essentially can not do my research because I can not run even the most basic 2-D subduction model on the two clusters that I have available to me.

So, what do I do now? Wolfgang mentioned not being able to attach a debugger… is that case? Could I in principal attach a debugger? (I’m sorry I’ve never learned about using debuggers, I never had to before). If not, what other options do I have? I am very frustrated at having my entire research program grind to a halt.

I am rerunning your problem in debug mode now. What is the smallest number of cores where it fails for you?

I’m not sure about the smallest size because I ran into wall-clock issues on cluster-2 when I used fewer cores. It is consistently failing on 96 cpus. AND it fails in both release and debug mode.

I am more than willing to other tests of any kind, but I don’t have the expertise to know what tests to do? On my older cluster I can run for as long as I want, so I can set up runs on fewer cores, it’ll just take a while longer because its an older machine.

Does it make sense to try compiling with different versions of gcc or openmpi?

Magali - Are these models crashing on both Peloton and ymir cluster at Davis?

Prior to upgrading the OS, all of us (Juliane, Rene, myself) had unexpected segmentation faults at random times. I can’t recall if it was in the pre-conditioner, but I was never able to resolve the problem after trying all sorts of different compilers and build options. I never ran into the same issue on ymir.

I think there are multiple options going forward:

  1. Try running the model on the new Peloton AMD nodes.
  2. Try running the model on Stampede2 or Comet.

Let’s talk later this afternoon and get those models running on one or all three of the clusters.