Hi,
I’m trying to run Aspect models in my university’s cluster. After some runtime, models freeze and terminate. Herewith i attached the output files and the job script. Could this be a problem running in multiple nodes or assigning more CPU’s than it needs to run the model ?
stderror-279914.txt (32.2 KB)
stdout-279914.txt (2.7 MB)
C5_3.txt (436 Bytes)
Best,
Chameera
@chameerasilva Hard to tell – the errors are uninformative. If you run the same input file multiple times, does it always stop in the same place/with the same error?
Best
W.
I will run the same file and report back. But all the input files i ran terminate at random time steps with same error.
Best,
Chameera
@chameerasilva If it’s at different times/different locations in the code every time, then it may simply be that you either run out of memory or out of your time allocation. Since you run through the SBATCH system, there is a time limit for how long your job is allowed to run, at which point it will be killed. In your job script, you have --time=3-00:00:00
, which equates to 3 days. I don’t know whether your models actually run this long; even if not, it is possible that your system administrators have set a smaller time limit for jobs and then the job will be killed after that amount of time.
Best
W.
The partition i used has 28 days as the time limit. I will check assigning lower memory.
Thank you.
Best,
Chameera
Hi,
I found this pull request Possible MPI Deadlock in GMG preconditioner · Issue #4984 · geodynamics/aspect · GitHub . I ran my models with block AMG and it does not freeze or terminate my models any more. Is there a installation setup that solve the issue for block GMG. my current setup is:
– . version 2.6.0-pre (main, 445fc5d4f)
– . using deal.II 9.5.1
– . with 64 bit indices and vectorization level 3 (512 bits)
– . using Trilinos 13.2.0
– . using p4est 2.3.2
best,
Chameera
@chameerasilva I see no reason to believe that the problem mentioned in the github issue is the one you are seeing. It may be, but it may also not. I mentioned a number of things above that are worth figuring out first:
- Does the program always stop at the same place.
- Does it actually run for the entire 3 days, or does it stop earlier? (You can look at the time stamp of output files, for example.)
- Does it run on the front end node?
Your goal for now should be to find more information about what is happening, what the systematic factors are that lead to the crash. There are probably 100 deal.II, Trilinos, p4est, and ASPECT github issues that describe deadlocks or other MPI problems, but in all likelihood none of them is the issue. You need more information.
Best
W.