For the following, let me state a conflict of interest I have: I am one
of the authors of ASPECT, and as a consequence it is not unreasonable to
assume that I will seek fault for cases like yours elsewhere. But (i)
I’m a professional, and (ii) you’re also a friend, so I’m going to try
and explain my experience with these kinds of things, and not my hopes
I’ve been doing software development on clusters for about 20 years now.
It’s a frustrating business, precisely because debugging is so
complicated and because it’s already bad enough to find bugs in a single
program doing its thing. Random crashes are particularly pernicious.
I have had maybe ten issues like yours over these years. I can remember
one where my computations aborted after several days of running, and
this turned out to be a bear to debug – the reason ultimately was that
the MPI system I was using only allowed me to create 64k MPI
communicators and after that simply aborted. This happened after several
thousand time steps.
There may have been one or two other issues that I could trace back to
an actual bug in code I had written, or that I had control of. But the
majority of these issues came down to either hardware problems, or to
incompatibilities in the software layer – say, the MPI library was
built against one Infiniband driver in the operating system, but that
driver had been updated in some slightly incompatible way during a
recent OS update. In the majority of these cases, I can’t say that I
ever managed to find out definitively what caused the problem. In all of
these cases, I never found a way to change my code to make things work
There is really very little one can do in these situations. Sometimes it
helps to just recompile every piece of software one builds on from
scratch after a recent OS update. Sometimes, I found that I just can’t
work on a cluster. That’s really frustrating. I believe that most of us
doing work on clusters share these kinds of experiences.
I wished that I could tell you what to do. It is possible that the bug
lies in ASPECT and that someone will eventually find it, though my
experience over the years, looking at the symptoms you describe,
suggests that it is more likely that (i) the root cause is not actually
in ASPECT itself, and (ii) that we will never find out what the cause
actually is. The only ways to address this problem in some kind of
systematic way is to make the testcase small enough and to find a way to
make it reproducible, at least on the same machine.