can you try if the bug happens with fewer refinements and fewer cores? Whatever you can figure out to make us get to the solution quicker would help immensely.
For the following, let me state a conflict of interest I have: I am one
of the authors of ASPECT, and as a consequence it is not unreasonable to
assume that I will seek fault for cases like yours elsewhere. But (i)
I’m a professional, and (ii) you’re also a friend, so I’m going to try
and explain my experience with these kinds of things, and not my hopes
I’ve been doing software development on clusters for about 20 years now.
It’s a frustrating business, precisely because debugging is so
complicated and because it’s already bad enough to find bugs in a single
program doing its thing. Random crashes are particularly pernicious.
I have had maybe ten issues like yours over these years. I can remember
one where my computations aborted after several days of running, and
this turned out to be a bear to debug – the reason ultimately was that
the MPI system I was using only allowed me to create 64k MPI
communicators and after that simply aborted. This happened after several
thousand time steps.
There may have been one or two other issues that I could trace back to
an actual bug in code I had written, or that I had control of. But the
majority of these issues came down to either hardware problems, or to
incompatibilities in the software layer – say, the MPI library was
built against one Infiniband driver in the operating system, but that
driver had been updated in some slightly incompatible way during a
recent OS update. In the majority of these cases, I can’t say that I
ever managed to find out definitively what caused the problem. In all of
these cases, I never found a way to change my code to make things work
There is really very little one can do in these situations. Sometimes it
helps to just recompile every piece of software one builds on from
scratch after a recent OS update. Sometimes, I found that I just can’t
work on a cluster. That’s really frustrating. I believe that most of us
doing work on clusters share these kinds of experiences.
I wished that I could tell you what to do. It is possible that the bug
lies in ASPECT and that someone will eventually find it, though my
experience over the years, looking at the symptoms you describe,
suggests that it is more likely that (i) the root cause is not actually
in ASPECT itself, and (ii) that we will never find out what the cause
actually is. The only ways to address this problem in some kind of
systematic way is to make the testcase small enough and to find a way to
make it reproducible, at least on the same machine.
Thanks Wolfgang for explaining more about your experience. I think I ended up frustrated because I really did not understand what you all meant with “it might be the cluster”. This actually could mean a lot of things it seems (the operating system, how other libraries were compiled, or maybe actually the hardware itself).
I’ve been running mantle convection models using Citcom for a long time, so my experience with debugging is limited to a code that was much more self-contained. Aspect is very different and it means that I’m really a novice when it comes to this kind of problem. I understand more now that you all have explained more and I just talk with John Naliboff as well. I now understand why its not likely that this error is in Aspect, but rather in something that it is built on (because the error after the segmentation fault indicates something maybe in trilinos or that trilinos is using). John also explained that this is why using a debugger won’t really help (e.g., if the problem is not in Aspect, we won’t really be able to figure out why a particular varialble in trilinos is suddenly NaN…)
Clearly demonstrating that this is an issue with just these 2 clusters, which are set-up in very similar (or exactly) the same way, is a good first step. John is getting me an access to Stampeded through the CIG account so I can run it there. I am also re-running the model on a smaller number of cpus, particularly I want to run it on a single node on the two local clusters as this could give some indication of where to look for problems with the cluster or things compiled on the cluster. Once I do that, I will talk with the sys-admin about potential issues on the cluster.
Since I do only have access to these two clusters, is there anything else you’d suggest trying, should I try compiling with with different versions of MPI or gcc? or is that likely not a fruitful path to go down? Any other suggestions that might help to indicate the issue? Any suggestions for what to ask the sys admin? The OS on both of these clusters was recently upgraded… should I ask if all the drivers were recompiled/updated as well? Anything else?
Thanks for your help,
Before you do that, can you just post what you have been using (MPI, gcc, trilinos, deal.II, …)? Also, it also makes sense to make the problem smaller before trying to recompile things.
Can you post the detailed.log from your ASPECT build directories?
Here are the detailed.log files from both clusters (ymir (old) and peloton (newer)).
They are almost identical except for the version of openmpi.
ymir_build_release_detailed.log (1.1 KB)
peloton_build_release_detailed.log (1.12 KB)
thanks, that looks reasonable (and new).
The only thing that is weird is that the COMPILE_FLAGS are empty. Who knows why.
Who installed the MPI library? Are others using the same machine for ASPECT so they might have run into the same problem?
I’ve used both clusters and may have run into the issue on at least one of them. I’m going to run a few tests over the weekend and see if I can reproduce the error with different models.
The OS on both machines in Ubuntu 18 (LTS, server version) and the MPI library was installed by the SYS Admin, who I talk with on a nearly daily basis. Numerous other groups at the University use the MPI library on one of the clusters, so my guess that is not the issue.
deal.II was installed via candi on both clusters and I don’t recall seeing anything weird during the installation process or when building ASPECT.
Any and all of these could be the cause, though no question you could ask is
likely to really shed light on what to do. It is possible that recompiling
with other compilers and/or modules works – the only way to find out is
trying, since we don’t have any other systematic ways of narrowing down the
problem I’m afraid
Personally, I would probably recompile things. Everything, down to p4est and
Trilinos, using the most recently installed compiler and MPI module.
The prm file ran to completion in debug and release mode on my machine. I used a smaller number of cores, though.
Thanks Timo. That’s good to know. I also have also have some information to add. I ran the same test model using only a single node on Peloton and it ran to completion. All the test cases that use multiple nodes end with a segmentation fault. So maybe this indicates something related to communication between nodes?
Also, I made a somewhat smaller version of this model (I cut-off the bottom of the mantle): the dynamics of subduction are similar. This model ran to completion using 6 nodes (96 cpus). So, this is indeed very finnicky.
I’m going to start a conversation with our cluster IT person. I also now have
access to the CIG account Stampede2 so I can keep working on the development while we figure out what to do with our local cluster issue.