Hi,
I have been running 3D models with no problem, but in one case, I had restart issues. It has been running fine for two weeks and reached 289 Myrs but in the latest job, suddenly the code couldn’t rename files: output_2/restart.mesh.new →
output_2/restart.mesh. The error code is -1.
Below is the beginning part and end of the slurm file.
-----------------------------------------------------------------------------
-- This is ASPECT --
-- The Advanced Solver for Planetary Evolution, Convection, and Tectonics. --
-----------------------------------------------------------------------------
-- . version 3.1.0-pre (main, 00b250750)
-- . using deal.II 9.5.1
-- . with 32 bit indices
-- . with vectorization level 2 (AVX, 4 doubles, 256 bits)
-- . using Trilinos 13.2.0
-- . using p4est 2.3.2
-- . using Geodynamic World Builder 1.0.0
-- . running in OPTIMIZED mode
-- . running with 256 MPI processes
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
-- https://aspect.geodynamics.org/citing.html?ver=3.1.0-pre&cbfheatflux=1&mf=1&sha=00b250750&src=code
-----------------------------------------------------------------------------
*** Resuming from snapshot!
Number of active cells: 653,961 (on 3 levels)
Number of degrees of freedom: 23,092,118 (16,775,406+724,910+5,591,802)
Number of mesh deformation degrees of freedom: 2,174,730
Solving mesh displacement system... 0 iterations.
*** Timestep 50113: t=2.65929e+08 years, dt=5050.07 years
Solving mesh surface diffusion
Solving mesh displacement system... 7 iterations.
Solving temperature system... 9 iterations.
Solving Stokes system (GMG)... 21+0 iterations.
Relative nonlinear residuals (temperature, Stokes system): 4.02233e-08, 5.3775e-07
Relative nonlinear residual (total system) after nonlinear iteration 1: 5.3775e-07
Postprocessing:
Temperature min/avg/max: 264 K, 1733 K, 2272 K
Topography min/max: -960 m, 316.6 m
*** Timestep 50114: t=2.65934e+08 years, dt=5045.57 years
Solving mesh surface diffusion
Solving mesh displacement system... 7 iterations.
Solving temperature system... 8 iterations.
Solving Stokes system (GMG)... 42+0 iterations.
Relative nonlinear residuals (temperature, Stokes system): 3.52988e-08, 4.35208e-07
Relative nonlinear residual (total system) after nonlinear iteration 1: 4.35208e-07
.
.
.
*** Timestep 54567: t=2.89549e+08 years, dt=5413.57 years
Solving mesh surface diffusion
Solving mesh displacement system... 7 iterations.
Solving temperature system... 9 iterations.
Solving Stokes system (GMG)... 29+0 iterations.
Relative nonlinear residuals (temperature, Stokes system): 7.13189e-08, 2.19867e-06
Relative nonlinear residual (total system) after nonlinear iteration 1: 2.19867e-06
Solving temperature system... 9 iterations.
Solving Stokes system (GMG)... 8+0 iterations.
Relative nonlinear residuals (temperature, Stokes system): 3.20247e-08, 1.99895e-07
Relative nonlinear residual (total system) after nonlinear iteration 2: 1.99895e-07
Postprocessing:
Temperature min/avg/max: 263.9 K, 1733 K, 2272 K
Topography min/max: -954.9 m, 325.2 m
mv: cannot stat 'output_2/restart.mesh.new': No such file or directory
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped
to avoid a possible deadlock.
---------------------------------------------------------
----------------------------------------------------
Exception 'ExcMessage(std::string ("Unable to rename files: ") + old_name + " -> " + new_name + ". The error code is " + Utilities::to_string(error) + ".")' on rank 0 on processing:
--------------------------------------------------------
An error occurred in line <63> of file </work/n03/n03/mazq/modules/aspect/source/simulator/checkpoint_restart.cc> in function
void aspect::{anonymous}::move_file(const string&, const string&)
The violated condition was:
error == 0
Additional information:
Unable to rename files: output_2/restart.mesh.new ->
output_2/restart.mesh. The error code is -1.
Stacktrace:
-----------
#0 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: ) [0x10533b3]
#1 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::create_snapshot()
#2 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::maybe_write_checkpoint(long, bool)
#3 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::run()
#4 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: void run_simulator<3>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, bool, bool)
#5 /work/n03/n03/mazq/modules/aspect/bin/aspect-release: main
--------------------------------------------------------
Aborting!
----------------------------------------------------
MPICH ERROR [Rank 0] [job id 9089936.0] [Thu Mar 20 05:32:43 2025] [nid006845] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: nid006845: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=9089936.0
slurmstepd: error: *** STEP 9089936.0 ON nid006845 CANCELLED AT 2025-03-20T05:32:46 ***
srun: error: nid006845: tasks 1-127: Terminated
srun: error: nid006847: tasks 128-255: Terminated
srun: Force Terminated StepId=9089936.0
Could you please help me with what caused this and how I can solve this error?
I may have made a mistake of resubmitting the job because I found the code just started from 0 Myr. Does that mean that I lost the file restart.mesh.new which saved all the solutions at timestep 54556 (289 Myrs) and rewrote it with the restart.mesh.new at timestep 0? I would greatly appreciate it if you could help me with this. Thank you!
Best,
Ziqi