Failure on quadrants after running for long time

Hi all,

  I am running ASPECT with this configuration:

-----------------------------------------------------------------------------

– This is ASPECT, the Advanced Solver for Problems in Earth’s ConvecTion.
– . version 2.0.1 (aspect-2.0, 2863594)
– . using deal.II 9.0.0
– . using Trilinos 12.10.1
– . using p4est 2.0.0
– . running in DEBUG mode
– . running with 1088 MPI processes
– How to cite ASPECT: https://aspect.geodynamics.org/cite.html

i am simulating a 3D convecting system that was working well inn previous aspect versions. Now, I run it, it goes well until about 700 steps depending on the run(I have tryied several times). It suddenly fails on writing the quadrants …of the mesh? with sc_io.c:586 not habding it appropriately?

I would appreciate if you could advice me something…
warm regards,
Felipe
*** Timestep 699: t=4.06458e+15 seconds
Solving temperature system… 4 iterations.
Solving C_1 system … 15 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 31+0 iterations.

Postprocessing:
RMS, max velocity: 5.25e-10 m/s, 2.23e-09 m/s
Temperature min/avg/max: 1799 K, 1800 K, 2100 K
Heat fluxes through boundary parts: -593.9 W, -1828 W, -6447 W, -6900 W, -4.032e+09 W, 0.2552 W
Compositions min/max/mass: -0.02513/1.023/5.299e+18
Number of advected particles: 20000

Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: write quadrants
Abort: /work/04020/unfelipe/stampede2/software/candi/install/tmp/unpack/p4est-2.0/sc/src/sc_io.c:586
Abort
Abort: /work/04020/unfelipe/stampede2/software/candi/install/tmp/unpack/p4est-2.0/sc/src/sc_io.c:586
Abort
Abort: /work/04020/unfelipe/stampede2/software/candi/install/tmp/unpack/p4est-2.0/sc/src/sc_io.c:586
Abort

Hi Felipe,

Thanks for posting to the forum! I have seen similar errors in p4est, so hopefully there is a relatively easy fix.

A few questions to start:

  1. What cluster are you running on - Stampede2 or a local cluster?
  2. Do you receive the same error when running in release mode?
  3. Do you by chance know if the error is occurring during writing of a restart file or visualization file?

Thanks,
John

Hola John,

    Thanks for your reply.

    1. I am running on TACC Stampede2.

    2. Release mode?..I am not sure what that means.

    3. I recompiled ASPECT to have it in debug mode, expecting to get more info on the error, but the slurm log was very similar to that of optimized. This is an ASPECT simulation that fails on the first resumed stage: it runs well from step 0 to n1, and after a cluster timeout, I resume it; then it runs well...until it fails ! (not reaching step 2n1).

   I am considering installing aspect 1.5.0 which is the one i used when i started that project, and those simulations where running well..    sad :(

      But my guess is that this is a stability error that must have a cure, on aspect 2.0.1 or so.

cheers,

Felipe

Hi Felipe,

Release mode is synonymous with optimized mode (e.g., not debug mode). Unfortunately, I’m struggling to find information to help debug the problem.

In the previous time steps, what is reported in the log file after “Number of advected particles:”? At this point, I’m not sure if p4est is failing trying to write a specific file type, or something else (adaptive meshing).

Also, can you tell me a bit more about your simulations (geomeotry, boundary conditions, etc)? Would you also mind posting the log file from the model run?

Cheers,
John

Hi John,

I paste the bottom of my logfile at the end, only the two last steps… It states nothing about the error :frowning: So, not much to learn about it, aside from what the slurm writes, which can be summarized as: Abort: write quadrants

Abort: /work/04020/unfelipe/stampede2/software/candi/install/tmp/unpack/p4est-2.0/sc/src/sc_io.c:586

I attach both the logfile and the slurm output for further investigations.

My system is a Cartesian with Temperature BCs: Dirichlet top and bottom, Newmann (0) on the sidewalls. IC is homogeneous, irrelevant.
Velocity BCs: Dirichlet on top, Newman on all the other walls (tangential).
Particles (just passive, on a compositional field w different mass-density) on a bottom layer.

I guess the only challenging part for ASPECT may be the particles…

subsection Initial composition model
set Model name = function

subsection Function
set Variable names = x,y,z
set Function expression = if(z<=0.2e6, 1, 0)
end
end

subsection Particles
set Number of particles = 20000

set Time between data output = 3.1558e15
set Data output format = vtu
set Particle generator name = ascii file

subsection Generator
subsection Ascii file
set Data directory = ./
set Data file name = MyTracers_20k_lower200km.txt
end
end

set List of particle properties = initial composition, initial position, pT path

cheers, from Beijing
Felipe

*** Timestep 698: t=4.05893e+15 seconds
Solving temperature system… 4 iterations.
Solving C_1 system … 14 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 31+0 iterations.

Postprocessing:
RMS, max velocity: 5.25e-10 m/s, 2.23e-09 m/s
Temperature min/avg/max: 1799 K, 1800 K, 2100 K
Heat fluxes through boundary parts: -601.1 W, -1830 W, -6444 W, -6898 W, -4.031e+09 W, 0.2205 W
Compositions min/max/mass: -0.02513/1.023/5.299e+18
Number of advected particles: 20000

*** Timestep 699: t=4.06458e+15 seconds
Solving temperature system… 4 iterations.
Solving C_1 system … 15 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 31+0 iterations.

Postprocessing:
RMS, max velocity: 5.25e-10 m/s, 2.23e-09 m/s
Temperature min/avg/max: 1799 K, 1800 K, 2100 K
Heat fluxes through boundary parts: -593.9 W, -1828 W, -6447 W, -6900 W, -4.032e+09 W, 0.2552 W
Compositions min/max/mass: -0.02513/1.023/5.299e+18
Number of advected particles: 20000

                                                                             9869,1        Bot

log.txt (433 KB)

Sorry i forgot to attach the other file.

here it goes.

Felipe

(Attachment slurm-4132282.out is missing)

Sorry i forgot to attach the other file.

here it goes !

Felipe

slurm.txt (180 KB)

Hi Felipe,

Thank you for attaching the log file. Based on previous time steps, I think the failure is occurring when a snapshot is being created.

I have not encountered this specific issue before and using a debugger is not really an option here. Does a similar error occur with smaller models (e.g., lower resolution) or is it exclusive to this particular model?

How long has it been since you updated the ASPECT master? There was a recent pull request that re-ordered how restart files are written, but I don’t think that is the issue here.

My suggestion for now is to build a new version of deal.II with candi, which builds the default deal.II version specified in candi (up to 9.1.x I think). Next, rebuild ASPECT after updating the current master branch. You could also trying updating ASPECT first with the version of deal.II already in use.

Other suggestions?

Cheers,
John

Hi John,

Thanks for your suggestions.

I wonder if the instructions and dependencies have changed. I updated these things about a year ago.

Install deal.II (v xxx) via candi:

    git clone [https://github.com/dealii/candi.git](https://github.com/dealii/candi.git)
    cd candi

    % Replace (or change contents) of candi.cfg with candi.cfg provided in this email (?? where to find them)

    export FC=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpif90

    ./candi.sh -j 48 --prefix=/home1/04020/unfelipe/stampede2/software/candi/install/

cheers,

Felipe

Hi Felipe,

A large number of people, myself included use the ASPECT Stampede2 instructions on a regular basis, so I don’t think outdated compilers links, etc are the problem.

Before you try recompiling with the newer deal.II and ASPECT versions, here is a suggestion from Rene: Rename the old (previous restart output) restart files (restart.mesh.info.old, restart.resume.z.old, restart.mesh.old) to the most recent restart file names (restart.mesh.info, restart.resume.z, restart.mesh).

Next, restart the computation, which will of course then start from about 50 time steps back.

We are looking to see if the same restart file writing problem occurs again after starting from a few (about 50) time steps back. I suspect it will, but it is a cheap test to check.

Let us know how it goes.

Cheers,
John

Hi Felipe,
I also checked the line that caused the error inside p4est. It is simply the MPI_Write of some data that fails. Therefore I see three possibilities:

  • You are running out of disk space
  • Something changed in your dependencies and the write fails because of that
  • The data that ASPECT is trying to write is somehow corrupted.

The check for the first possibility is simple, take a look at what Stampede reports as you free storage when you log in.
For the second consider just recompiling the versions of deal.II and ASPECT that you use at the moment (switching to newer versions like John suggested could help as well, but might also lead to new problems, so maybe stay with your current versions for now).
For the third maybe try running this model without the particles. I am not sure, but if you have a single cell with a huge number of particles this can increase the size of checkpoint files significantly.

Best,
Rene

Dear John and Rene,

Thanks a lot for your support.

I am revising the account and files and etc. I think I am running out of space…My Stampede2 account directory has a 10 GB memory limit, but each one of my meshes are likely surpassing 5 GB, so just having the current and the old one would ‘‘spill the water out of the glass’’.

The thing is that for reaching this state I have worked for long time, and I could not run small simulations… I am in the ‘large simulations’ part of the project… I will try to see what the guys on TACC say if I request some more memory for the account (just memory, not CPU). I guess with 5 GB more I could be greatly helped…

Otherwise, the bad solution: is there any way I can suppress ASPECT periodic filing/keeping an old mesh and dependencies?

cheers,

Felipe

Welcome to Stampede2, please read these important system notes:

→ Stampede2 user documentation is available at:
https://portal.tacc.utexas.edu/user-guides/stampede2

--------------------- Project balances for user unfelipe ----------------------

Name Avail SUs Expires | |
TG-EAR160015 20505 2020-06-30 | |
------------------------ Disk quotas for user unfelipe ------------------------

Hi Felipe,

That’s great news it is just a disk quota issue.

Normally it is not difficult to request more disk space. The formal procedure is to submit a supplementary request to the XSSEDE proposal.

For now, you could simply not write out restart files. There is a way to suppress keeping the old restart files, but that requires a source code modification. My suggestion is to see if you can a quick increase in available disk space.

Cheers,
John

Also the 10 GB limitation only applies to the $HOME directory. Unless you are running a trial account, you should also have access to a $WORK directory with 1 TB space. If you are running a trial account then John is right, asking for more space is the best solution.

I think you are talking about the disk space limit, not the memory
limit. But in any case, I’d be greatly surprised if they even flinched
if you asked for more disk space. 10 GB is a quite small amount.

They may also point you at the user documentation that suggests running
these sorts of things on the scratch file system, instead of your home
directory. You may want to just move your output directory to the
scratch file system, and then update the .prm file appropriately.

Best
W.

Hi all of you,

Thanks for your support.

I am using a Stampede2 research allocation account. One of TACC admins wrote me, she said I should simply use the 1TB work directory.

I knew I had that space, but I was completely unaware that I could use it for these purposes. All good. Things should be ok now.

cheers and thanks to you all,

from Beijing,

Felipe

The guide about filesystems ( https://portal.tacc.utexas.edu/user-guides/stampede2#file-systems-introduction ) is quite specific about the usage of $HOME:
" Not intended for parallel or high-intensity file operations. "

thanks !

Felipe