Okay. I am running it on my machine and will let you know if it crashes at some point.
So far it is working for me:
*** Timestep 619: t=2.28648e+07 years
How’s you test run going? I had one run out time, a node died on another and the third one stop with no information. I am re-running now with more processors and longer wall-clock time (this is on the newer cluster).
it ran to completion without errors.
Okay, so here is something interesting… I’m running two models that are identical except one is using Aspect compiled in release mode and the other is using is compiled in debug mode. The one in debug mode just had a Segmentation Fault (same as before… in the Rebuilding Preconditioner step). The one in release mode is already past the same point and is still running. So, the easy solution is not to run in debug mode unless I’m really debugging. The question is whether whatever “bug” this is ends up a problem down the line even in release mode. Any idea of what to do next? In a smaller program, this is where I’d put in “printf statements”
Your descriptions sound a lot like problems with the cluster or the
system software in itself, rather than with the software you are running:
- The point where the software fails is non-reproducible
- The code that does more checking (debug mode) fails whereas the
code that does less checking (release mode) does not fail
My suspicion is that always using release mode will only yield different
kinds of problems, but not be a reliable substitute.
These issues are really awkward. In similar cases in the past, I’ve
tried to run the same simulation (and same code) multiple times to see
whether there is a statistical pattern to what is going wrong. It’s
about the only suggestion I can make for how to attack the issue.
I don’t think this is necessarily cluster dependent, I’ve now run it on two different clusters. Also, although the exact time-step that it dies is not reproducible (although I need to try to run it again with the same number of threads and test that) it always dies in exactly same part of the code… rebuilding the preconditioner.
I am suspicious that there is a bug in the material module that I’m using… visco plastic. Am I correct in thinking that when rebuilding the preconditioner, it needs the the viscosity (the stiffness matrix)? I’m guessing that there are not so many people using this specific module, so there could be some bug that I’m just now finding because I trying to really use the module with all its bells and whistles (like the composition-dependent rheology).
Also, I was always taught that a segmentation fault basically means that you are trying to grab information from somewhere in memory that you shouldn’t… so somewhere you are counting wrong. Is that still true? Is it a good way to think about it or not (in terms of trying to track down the problem).
I am trying to think of ways to change the problems so that is still basically runs the same, but that it might help us to figure out what is triggering the problem, but that is more difficult.
Fundamentally, a “segmentation fault” means that the program is accessing a
memory location to which it doesn’t have access. There are many reasons why
this can happen:
One could try to read memory using a pointer that has not been initialized
and that consequently points to memory that doesn’t exist, or that is owned by
One could try to read memory using a pointer that used to point to valid
memory, but where the pointer was overwritten with invalid values at some
point during the program’s run. This is often very difficult to track down
because the problem doesn’t manifest where the bug actually happened: Some
part of the program may be writing past the end of an array, for example, and
thereby overwrite the address stored in a pointer located behind the end of
the array. But you will only notice next time you use this pointer to read
from somewhere, which may happen many cycles later.
One could try to read memory using a pointer that looks like it has been
properly initialized using, for example, malloc(). But if no memory is
available, then malloc() simply returns a NULL pointer and writing to the
address then stored in the pointer (NULL) will result in a segmentation fault.
The reason why no memory is available may or may not have anything to do with
the currently running program: It may also be due to the fact that other large
programs are currently running on the same machine, or that a bug in a
software layer like MPI, the C++ runtime layer, Trilinos, p4est, or some other
involved part has led to the exhaustion of available memory.
None of this is easy to track down if the problem happens deep inside the MPI
implementation (and called via Trilinos) as in your case, and if one can’t
attach a debugger
I doubt that this is the case. I assume it is likely a bug inside Trilinos or something specific to your cluster. Sadly.
Okay. So I re-ran the same model again, and it dies again but at an earlier time-step. This means that I essentially can not do my research because I can not run even the most basic 2-D subduction model on the two clusters that I have available to me.
So, what do I do now? Wolfgang mentioned not being able to attach a debugger… is that case? Could I in principal attach a debugger? (I’m sorry I’ve never learned about using debuggers, I never had to before). If not, what other options do I have? I am very frustrated at having my entire research program grind to a halt.
I am rerunning your problem in debug mode now. What is the smallest number of cores where it fails for you?
I’m not sure about the smallest size because I ran into wall-clock issues on cluster-2 when I used fewer cores. It is consistently failing on 96 cpus. AND it fails in both release and debug mode.
I am more than willing to other tests of any kind, but I don’t have the expertise to know what tests to do? On my older cluster I can run for as long as I want, so I can set up runs on fewer cores, it’ll just take a while longer because its an older machine.
Does it make sense to try compiling with different versions of gcc or openmpi?
Magali - Are these models crashing on both Peloton and ymir cluster at Davis?
Prior to upgrading the OS, all of us (Juliane, Rene, myself) had unexpected segmentation faults at random times. I can’t recall if it was in the pre-conditioner, but I was never able to resolve the problem after trying all sorts of different compilers and build options. I never ran into the same issue on ymir.
I think there are multiple options going forward:
- Try running the model on the new Peloton AMD nodes.
- Try running the model on Stampede2 or Comet.
Let’s talk later this afternoon and get those models running on one or all three of the clusters.
can you try if the bug happens with fewer refinements and fewer cores? Whatever you can figure out to make us get to the solution quicker would help immensely.
For the following, let me state a conflict of interest I have: I am one
of the authors of ASPECT, and as a consequence it is not unreasonable to
assume that I will seek fault for cases like yours elsewhere. But (i)
I’m a professional, and (ii) you’re also a friend, so I’m going to try
and explain my experience with these kinds of things, and not my hopes
I’ve been doing software development on clusters for about 20 years now.
It’s a frustrating business, precisely because debugging is so
complicated and because it’s already bad enough to find bugs in a single
program doing its thing. Random crashes are particularly pernicious.
I have had maybe ten issues like yours over these years. I can remember
one where my computations aborted after several days of running, and
this turned out to be a bear to debug – the reason ultimately was that
the MPI system I was using only allowed me to create 64k MPI
communicators and after that simply aborted. This happened after several
thousand time steps.
There may have been one or two other issues that I could trace back to
an actual bug in code I had written, or that I had control of. But the
majority of these issues came down to either hardware problems, or to
incompatibilities in the software layer – say, the MPI library was
built against one Infiniband driver in the operating system, but that
driver had been updated in some slightly incompatible way during a
recent OS update. In the majority of these cases, I can’t say that I
ever managed to find out definitively what caused the problem. In all of
these cases, I never found a way to change my code to make things work
There is really very little one can do in these situations. Sometimes it
helps to just recompile every piece of software one builds on from
scratch after a recent OS update. Sometimes, I found that I just can’t
work on a cluster. That’s really frustrating. I believe that most of us
doing work on clusters share these kinds of experiences.
I wished that I could tell you what to do. It is possible that the bug
lies in ASPECT and that someone will eventually find it, though my
experience over the years, looking at the symptoms you describe,
suggests that it is more likely that (i) the root cause is not actually
in ASPECT itself, and (ii) that we will never find out what the cause
actually is. The only ways to address this problem in some kind of
systematic way is to make the testcase small enough and to find a way to
make it reproducible, at least on the same machine.
Thanks Wolfgang for explaining more about your experience. I think I ended up frustrated because I really did not understand what you all meant with “it might be the cluster”. This actually could mean a lot of things it seems (the operating system, how other libraries were compiled, or maybe actually the hardware itself).
I’ve been running mantle convection models using Citcom for a long time, so my experience with debugging is limited to a code that was much more self-contained. Aspect is very different and it means that I’m really a novice when it comes to this kind of problem. I understand more now that you all have explained more and I just talk with John Naliboff as well. I now understand why its not likely that this error is in Aspect, but rather in something that it is built on (because the error after the segmentation fault indicates something maybe in trilinos or that trilinos is using). John also explained that this is why using a debugger won’t really help (e.g., if the problem is not in Aspect, we won’t really be able to figure out why a particular varialble in trilinos is suddenly NaN…)
Clearly demonstrating that this is an issue with just these 2 clusters, which are set-up in very similar (or exactly) the same way, is a good first step. John is getting me an access to Stampeded through the CIG account so I can run it there. I am also re-running the model on a smaller number of cpus, particularly I want to run it on a single node on the two local clusters as this could give some indication of where to look for problems with the cluster or things compiled on the cluster. Once I do that, I will talk with the sys-admin about potential issues on the cluster.
Since I do only have access to these two clusters, is there anything else you’d suggest trying, should I try compiling with with different versions of MPI or gcc? or is that likely not a fruitful path to go down? Any other suggestions that might help to indicate the issue? Any suggestions for what to ask the sys admin? The OS on both of these clusters was recently upgraded… should I ask if all the drivers were recompiled/updated as well? Anything else?
Thanks for your help,
Before you do that, can you just post what you have been using (MPI, gcc, trilinos, deal.II, …)? Also, it also makes sense to make the problem smaller before trying to recompile things.
Can you post the detailed.log from your ASPECT build directories?
Here are the detailed.log files from both clusters (ymir (old) and peloton (newer)).
They are almost identical except for the version of openmpi.
ymir_build_release_detailed.log (1.1 KB)
peloton_build_release_detailed.log (1.12 KB)
thanks, that looks reasonable (and new).
The only thing that is weird is that the COMPILE_FLAGS are empty. Who knows why.
Who installed the MPI library? Are others using the same machine for ASPECT so they might have run into the same problem?