Problem getting output in Aspect

Hello,

 I am facing a problem that escapes my understanding and my knowledge of ASPECT post-processors behavior and capabilities.

 My problem is simple: I cannot obtain good graphical output at my prescribed times.

 I have been running a simulation for more than two weeks, all good, producing output, graphical output every now and then according to my settings, etc. Two days ago the simulation time reached the next (latest) output time, then I downloaded the data and I saw that the .vtu file weighs just a few kbytes or so.. (and Paraview can’t plot it). My simulation has more than 6 M degrees of freedom at this stage, and the previous .vtu files weigh about 20 Mb.

 So, I decided to resume the run, and prescribe the output now at a sooner time, to get an output quickly. Then, what happens blows my mind: if the output interval is “small” the simulation will simply not run..

For example:

set Time between graphical output = 3.15576e14 (which is 10 Ma)

any value smaller than or equal to 10 Ma will make the simulation to remain stalled…

Importantly, at this stage, the time steps (given by the solver) are of about 3e11 s. So I expect any output time interval larger than that to work properly.

I also tried discarding all the latest mesh files, and use the old ones just renaming them. I hoped that the simulation was healthy at a younger stage (a strategy which in many occasions work). The problem was the same, no run if the output prescribed time is too soon…

My problem is that I have limited resources (Stampede2) remaining for my project, so I can’t afford wasting service units, I can’t afford simply wait 50 Ma of computed time to get another output.. Worse so if I have no assurance that such a future output will be healthy..

What should I do? What should I check for?

I appreciate your help,

Felipe

Can you share some additional information please? With what you are writing, it is hard to understand what is happening exactly.

What is happening exactly when you say “make the simulation to remain stalled…” ? What output format are you using? Grouping of files? What file system is the output written to? How many MPI ranks?

Hi Timo, or to whom it may concern,

Your email awakes me in paying attention to details I did not thought about. Helpful.

When my simulations stall, the running time posting is this:

– This is ASPECT, the Advanced Solver for Problems in Earth’s ConvecTion.

– . version 2.0.1 (aspect-2.0, 2863594)

– . using deal.II 9.0.0

– . using Trilinos 12.10.1

– . using p4est 2.0.0

– . running in DEBUG mode

– . running with 1088 MPI processes

– How to cite ASPECT: https://aspect.geodynamics.org/cite.html

Hi Timo, or to whom it may concern,

Your email awakes me in paying attention to details I did not thought about. Helpful.

When my simulations stall, the running time posting is this:

– This is ASPECT, the Advanced Solver for Problems in Earth’s ConvecTion.
– . version 2.0.1 (aspect-2.0, 2863594)
– . using deal.II 9.0.0
– . using Trilinos 12.10.1
– . using p4est 2.0.0
– . running in DEBUG mode
– . running with 1088 MPI processes
– How to cite ASPECT: https://aspect.geodynamics.org/cite.html

*** Resuming from snapshot!

Number of active cells: 124,153 (on 9 levels)
Number of degrees of freedom: 6,034,421 (3,526,572+156,801+1,175,524+1,175,524)

*** Timestep 4080: t=4.74463e+15 seconds
Solving temperature system… 3 iterations.
Solving C_1 system … 14 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 57+0 iterations.

Postprocessing:

%%%%%%%%%%%%%%%%%%%%%%%%%%
So it stalls at postprocessing.

These are my output settings:

subsection Postprocess

set List of postprocessors = velocity statistics, temperature statistics, heat flux statistics, visualization, composition statistics, particles

subsection Visualization
set List of output variables = density, viscosity # dynamic topography
set Time between graphical output = 3.15576e13
set Output format = vtu
set Number of grouped files = 1

I was also using particles, but I have disabled that postprocess to clean up the issue.

What I did then, regarding your questions, was to set different values for grouped files, I tried 8 and 0 (no grouping)
In both cases it failed, but at least it gave some more info:

Number of active cells: 124,153 (on 9 levels)
Number of degrees of freedom: 6,034,421 (3,526,572+156,801+1,175,524+1,175,524)

*** Timestep 4070: t=4.74214e+15 seconds
Solving temperature system… 3 iterations.
Solving C_1 system … 14 iterations.
Rebuilding Stokes preconditioner…
Solving Stokes system… 83+0 iterations.

Postprocessing:


An error occurred in line <5777> of file </work/04020/unfelipe/stampede2/software/candi/install/tmp/unpack/deal.II-v9.0.0/source/base/data_out_base.cc> in function
void dealii::DataOutBase::write_visit_record(std::ostream&, const std::vector<std::pair<double, std::vector<std::__cxx11::basic_string > > >&)
The violated condition was:
domain->second.size() == nblocks
Additional information:
piece_names should be a vector of equal sized vectors.

Stacktrace:

#0 /work/04020/unfelipe/stampede2/software/candi/install/deal.II-v9.0.0/lib/libdeal_II.g.so.9.0.0: dealii::DataOutBase::write_visit_record(std::ostream&, std::vector<std::pair<double, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >, std::allocator<std::pair<double, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > > > const&)
#1 /scratch/04020/unfelipe/software/candi/aspect/aspect: aspect::Postprocess::Visualization<3>::write_master_files(dealii::DataOut<3, dealii::DoFHandler<3, 3> > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&)
#2 /scratch/04020/unfelipe/software/candi/aspect/aspect: aspect::Postprocess::Visualization<3>::executeabi:cxx11
#3 /scratch/04020/unfelipe/software/candi/aspect/aspect: aspect::Postprocess::Manager<3>::executeabi:cxx11
#4 /scratch/04020/unfelipe/software/candi/aspect/aspect: aspect::Simulator<3>::postprocess()
#5 /scratch/04020/unfelipe/software/candi/aspect/aspect: aspect::Simulator<3>::run()
#6 /scratch/04020/unfelipe/software/candi/aspect/aspect: void run_simulator<3>(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, bool, bool)
#7 /scratch/04020/unfelipe/software/candi/aspect/aspect: main

Calling MPI_Abort now.
To break execution in a GDB session, execute ‘break MPI_Abort’ before running. You can also put the following into your ~/.gdbinit:
set breakpoint pending on
break MPI_Abort
set breakpoint pending auto
application called MPI_Abort(MPI_COMM_WORLD, 255) - process 0
TACC: MPI job exited with code: 255
TACC: Shutdown complete. Exiting.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

What should I do?

I think this issue should be solvable, as the simulation can actually run if I set an output time larger than about 10 Ma… So, it does run… so I guess it should be able to postprocess graphical output.

cheers,

Felipe

In general, you should run in release mode not in debug mode for large computations (especially on a thousand cores!). In this case, it is helpful to have the debug error messages though.

What did you change from the original prm when you resumed the computation? I fear that you ended up in a situation where something is inconsistent. Does the prm crash if you start without resuming from a checkpoint?

What filesystem do you use for your graphical output?

If you think that you found a bug, it would be helpful to have a minimal example that breaks without resuming computations.

You have to figure out if what you are computing is sensible. I am afraid we can not help with this directly.

Hi Timo,

Problem solved.

In my TACC Stampede2 acccount, the directory (scratch) were I was running had been set with a limit on output ‘frequency’. This was something they did long ago to prevent instabilities with my run.

It took me several tests to realize, and recall all of that… and to really learn how their restriction worked… (I never saw it working).

Anyway, your diagnosis and attention was helpful, really helpful. Thank you very much.

Thanks Aspect team.

Felipe