3D model restart issues

Hi,
I have been running 3D models with no problem, but in one case, I had restart issues. It has been running fine for two weeks and reached 289 Myrs but in the latest job, suddenly the code couldn’t rename files: output_2/restart.mesh.new →
output_2/restart.mesh. The error code is -1.
Below is the beginning part and end of the slurm file.

-----------------------------------------------------------------------------
--                             This is ASPECT                              --
-- The Advanced Solver for Planetary Evolution, Convection, and Tectonics. --
-----------------------------------------------------------------------------
--     . version 3.1.0-pre (main, 00b250750)
--     . using deal.II 9.5.1
--     .       with 32 bit indices
--     .       with vectorization level 2 (AVX, 4 doubles, 256 bits)
--     . using Trilinos 13.2.0
--     . using p4est 2.3.2
--     . using Geodynamic World Builder 1.0.0
--     . running in OPTIMIZED mode
--     . running with 256 MPI processes
-----------------------------------------------------------------------------

-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
--   https://aspect.geodynamics.org/citing.html?ver=3.1.0-pre&cbfheatflux=1&mf=1&sha=00b250750&src=code
-----------------------------------------------------------------------------
*** Resuming from snapshot!

Number of active cells: 653,961 (on 3 levels)
Number of degrees of freedom: 23,092,118 (16,775,406+724,910+5,591,802)

Number of mesh deformation degrees of freedom: 2,174,730
   Solving mesh displacement system... 0 iterations.
*** Timestep 50113:  t=2.65929e+08 years, dt=5050.07 years
   Solving mesh surface diffusion
   Solving mesh displacement system... 7 iterations.
   Solving temperature system... 9 iterations.
   Solving Stokes system (GMG)... 21+0 iterations.
      Relative nonlinear residuals (temperature, Stokes system): 4.02233e-08, 5.3775e-07
      Relative nonlinear residual (total system) after nonlinear iteration 1: 5.3775e-07


   Postprocessing:
     Temperature min/avg/max: 264 K, 1733 K, 2272 K
     Topography min/max:      -960 m, 316.6 m

*** Timestep 50114:  t=2.65934e+08 years, dt=5045.57 years
   Solving mesh surface diffusion
   Solving mesh displacement system... 7 iterations.
   Solving temperature system... 8 iterations.
   Solving Stokes system (GMG)... 42+0 iterations.
      Relative nonlinear residuals (temperature, Stokes system): 3.52988e-08, 4.35208e-07
      Relative nonlinear residual (total system) after nonlinear iteration 1: 4.35208e-07
.
.
.
*** Timestep 54567:  t=2.89549e+08 years, dt=5413.57 years
   Solving mesh surface diffusion
   Solving mesh displacement system... 7 iterations.
   Solving temperature system... 9 iterations.
   Solving Stokes system (GMG)... 29+0 iterations.
      Relative nonlinear residuals (temperature, Stokes system): 7.13189e-08, 2.19867e-06
      Relative nonlinear residual (total system) after nonlinear iteration 1: 2.19867e-06

   Solving temperature system... 9 iterations.
   Solving Stokes system (GMG)... 8+0 iterations.
      Relative nonlinear residuals (temperature, Stokes system): 3.20247e-08, 1.99895e-07
      Relative nonlinear residual (total system) after nonlinear iteration 2: 1.99895e-07


   Postprocessing:
     Temperature min/avg/max: 263.9 K, 1733 K, 2272 K
     Topography min/max:      -954.9 m, 325.2 m

mv: cannot stat 'output_2/restart.mesh.new': No such file or directory
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped
to avoid a possible deadlock.
---------------------------------------------------------


----------------------------------------------------
Exception 'ExcMessage(std::string ("Unable to rename files: ") + old_name + " -> " + new_name + ". The error code is " + Utilities::to_string(error) + ".")' on rank 0 on processing:

--------------------------------------------------------
An error occurred in line <63> of file </work/n03/n03/mazq/modules/aspect/source/simulator/checkpoint_restart.cc> in function
    void aspect::{anonymous}::move_file(const string&, const string&)
The violated condition was:
    error == 0
Additional information:
    Unable to rename files: output_2/restart.mesh.new ->
    output_2/restart.mesh. The error code is -1.
Stacktrace:
-----------
#0  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: ) [0x10533b3]
#1  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::create_snapshot()
#2  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::maybe_write_checkpoint(long, bool)
#3  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: aspect::Simulator<3>::run()
#4  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: void run_simulator<3>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, bool, bool)
#5  /work/n03/n03/mazq/modules/aspect/bin/aspect-release: main
--------------------------------------------------------

Aborting!
----------------------------------------------------
MPICH ERROR [Rank 0] [job id 9089936.0] [Thu Mar 20 05:32:43 2025] [nid006845] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: nid006845: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=9089936.0
slurmstepd: error: *** STEP 9089936.0 ON nid006845 CANCELLED AT 2025-03-20T05:32:46 ***
srun: error: nid006845: tasks 1-127: Terminated
srun: error: nid006847: tasks 128-255: Terminated
srun: Force Terminated StepId=9089936.0

Could you please help me with what caused this and how I can solve this error?
I may have made a mistake of resubmitting the job because I found the code just started from 0 Myr. Does that mean that I lost the file restart.mesh.new which saved all the solutions at timestep 54556 (289 Myrs) and rewrote it with the restart.mesh.new at timestep 0? I would greatly appreciate it if you could help me with this. Thank you!

Best,
Ziqi

@maziqi96 It’s possible that you lost the restart file – you’ll have to look into the directory where it is stored to see what files you have left there.

As for the cause, did you run out of disk space allocation?
Best
W>

Hi Wolfgang,
Thanks for your timely response.
Running out of disk space was my first suspicion, but that’s not the case. There was and still is plenty of space left on the disk (about 1T). What baffles me is that all the other cases are running well but this case lost its restart files at a random timestep even though nothing went wrong in the previous timesteps, and all the restart files were stored in the output directory.
Anyway, I guess there’s no other solution but to rerun the case because the restart files have already been overwritten by the new restart file from timestep 0. Thank you for your help nonetheless!

Best,
Ziqi

Hi Wolfgang,
I am sorry to restart this conversation. The code crashed again because it couldn’t find the restart.mesh file or ‘unable to rename files: output_1/restart.mesh.new → output_1/restart.mesh. The error code is -1.’ despite the restart.mesh files are right there in the output_1 directory

mazq@ln02:/work/n03/n03/mazq/modules/aspect/3Dtest15_vp_5/output_1> ls
depth_average.gnuplot	     topography.107595	topography.32348  topography.55396  topography.77861
depth_average.txt	     topography.107795	topography.32543  topography.55600  topography.78047
log.txt			     topography.107948	topography.32737  topography.55804  topography.78233
original.prm		     topography.107996	topography.32932  topography.56007  topography.78420
parameters.json		     topography.108196	topography.33127  topography.56211  topography.78606
parameters.prm		     topography.108396	topography.33322  topography.56415  topography.78793
**restart.mesh.info.old**	     topography.108397	topography.33517  topography.56618  topography.78820
**restart.mesh.new**	     topography.10847	topography.33712  topography.56822  topography.78981
**restart.mesh.new.info**	     topography.108596	topography.33907  topography.56842  topography.79168
**restart.mesh.new_fixed.data**  topography.108597	topography.34102  topography.57026  topography.79356
**restart.mesh_fixed.data.old**  topography.108797	topography.34297  topography.57229  topography.79544
**restart.resume.z.new**	     topography.11055	topography.34493  topography.57433  topography.79732
**restart.resume.z.old**	     topography.11263	topography.34688  topography.57637  topography.79920
solution		     topography.11470	topography.34884  topography.57841  topography.80108
solution.pvd		     topography.11678	topography.35080  topography.58045  topography.80297
solution.visit		     topography.11886	topography.35275  topography.58248  topography.80486
statistics		     topography.12093	topography.35471  topography.58452  topography.80675
topography.00000	     topography.12300	topography.35668  topography.58656  topography.80864
topography.00017	     topography.12507	topography.35864  topography.58860  topography.81054
topography.00033

And there is plenty of disk space.

lfs quota -hu mazq /work/n03/n03
Disk quotas for usr mazq (uid 10560):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
  /work/n03/n03  1.366T      0k   2.93T       -  445685       0       0       -

I attached the slurm file below and I didn’t resubmit it this time because I can’t afford to lose the restart.mesh files and restarting the code from the beginning because it took me more than 40 days to reach this far (277Myrs) and I need to get results (till 300 Myrs) before EGU and I simply couldn’t afford to run the code from the start.
slurm-9249864.txt (947.2 KB)
I would greatly appreciate it if you could help me figure out why the code crashed so that it won’t happen again and how to resume running the parameter file from 277 Myrs, not the beginning. Thank you very much!

Cheers,
Ziqi

Hi Ziqi,

Thanks for sharing the detailed information. I’m also currently using the 3D restart functionality in spherical models, and I haven’t encountered this specific issue so far. However, I do have a few suggestions that might help:

  1. Manual file move after backup
    After the backup is generated (e.g., restart.mesh.new), could you try manually moving it to a different filename (like restart.mesh.test) to check whether it’s a file permission issue? This can help isolate whether the system is blocking the rename operation due to permissions or file locks.

  2. File replication and effective disk quota
    On some HPC systems, file replication is enabled by default (especially in Lustre-based systems). This means the effective usable disk space is actually half of the reported quota. Even if your lfs quota shows enough space, you could still hit the limit due to replication overhead.

  3. Check group quota with mmlsquota
    It might be helpful to run the following command to check your group quota:

    mmlsquota -g <your_group_name> --block-size auto
    

    In some cases, the user quota is fine but the group quota is exceeded, which can silently block further writes. If your HPC system provides a scratch directory, you could test running the job there by copying your previous outputs and checking whether the restart works as expected.

I hope some of the above ideas might help pinpoint the issue. Please let me know if you’d like me to test anything on my side.

Best,
Ninghui

Hi Ninghui,
Thank you very much for your timely reply. I have tried the three things that you suggested:

  1. Manual file move after backup
    After the backup is generated (e.g., restart.mesh.new), could you try manually moving it to a different filename (likerestart.mesh.test) to check whether it’s a file permission issue? This can help isolate whether the system is blocking the rename operation due to permissions or file locks.

I can rename the file ‘restart.mesh.new’ to restart.mesh.test
I don’t think it is a permission issue because:

mazq@ln02:/work/n03/n03/mazq/modules/aspect/3Dtest17_rheology/output_4> ls -la
total 6685492
drwxr-sr-x 3 mazq n03 12288 Apr 4 14:23 .
drwxr-sr-x 11 mazq n03 4096 Apr 4 14:24 ..
-rw-r–r-- 1 mazq n03 3657213 Apr 4 09:59 depth_average.gnuplot
-rw-r–r-- 1 mazq n03 1164924 Apr 4 09:59 depth_average.txt
-rw-r–r-- 1 mazq n03 11713300 Apr 4 10:19 log.txt
-rw-r–r-- 1 mazq n03 9079 Apr 4 07:37 original.prm
-rw-r–r-- 1 mazq n03 1093069 Apr 4 07:37 parameters.json
-rw-r–r-- 1 mazq n03 795734 Apr 4 07:37 parameters.prm
-rw-r–r-- 1 mazq n03 102 Apr 4 10:19 restart.mesh.info
-rw-r–r-- 1 mazq n03 22091888 Apr 4 10:19 restart.mesh.new
-rw-r–r-- 1 mazq n03 3129189712 Apr 4 10:19 restart.mesh.new_fixed.data
-rw-r–r-- 1 mazq n03 22093680 Apr 4 10:16 restart.mesh.old
-rw-r–r-- 1 mazq n03 3129648912 Apr 4 10:16 restart.mesh_fixed.data.old
-rw-r–r-- 1 mazq n03 1254596 Apr 4 10:19 restart.resume.z.new
-rw-r–r-- 1 mazq n03 1254094 Apr 4 10:16 restart.resume.z.old

  1. File replication and effective disk quota
    On some HPC systems, file replication is enabled by default (especially in Lustre-based systems). This means the effective usable disk space is actually half of the reported quota. Even if your lfs quota shows enough space, you could still hit the limit due to replication overhead.

I don’t think this is the case because I ran models in the same directory and it reached 2.8 TB before so I know I have roughly 3TB quota.

Check group quota with mmlsquota
It might be helpful to run the following command to check your group quota
In some cases, the user quota is fine but the group quota is exceeded, which can silently block further writes.
That might be the reason why but unfortunately I can’t find the command on Archer2.

mmlsquota -g /work/n03/n03 --block-size auto
-bash: mmlsquota: command not found
I don’t think there’s anything wrong my ASPECT or parameter file because other cases work fine. This issue only happened to some cases. Is there any other alternative command that I use?

Cheers,
Ziqi

Hi Ziqi,

Thank you very much for trying out the suggestions — I really appreciate your thorough follow-up and detailed feedback.

Another potential cause to consider: when submitting the job through a scheduler like SLURM or LSF, did the working directory change between the initial run and the restart attempt? If so, using absolute paths instead of relative paths in your parameter file might help prevent file access issues related to the current working directory.

Since you already have the restart.mesh.new and restart.resume.z.new files present, you could try manually renaming them to the expected filenames before restarting:

cd output_1
mv restart.mesh.new restart.mesh
mv restart.mesh.new.info restart.mesh.info
mv restart.mesh.new_fixed.data restart.mesh_fixed.data
mv restart.resume.z.new restart.resume.z

Before doing so, please make sure to back up your output_1 directory, just in case anything goes wrong. Once the files are renamed, try restarting the model with:

set Resume computation = true

If you do this, could you let me know whether the model resumes normally? That would help narrow down whether this is purely a file system/renaming issue.

Looking forward to hearing how it goes!

Best,
Ninghui

Hi Ninghui,
Thank you very much for your helpful reply!
No, the working directory never changes between the initial run and the restart attempt.
I tried your suggestion of manually renaming them to the expected filenames before restarting. Out of 9 cases that crashed because of failure to rename the restart files, 7 cases resumed normally, whereas 2 cases crashed after a few timesteps because of ‘Input/output error’. I have attached the slurm files for one successfully resumed case and two unsuccessfully resumed cases, which I hope can be of some help!
The successfully resumed case’s slurm file before manually renaming the restart files: output4_rename_error.txt (929.6 KB)
after manually renaming the restart files:
output4_resume.txt (1.1 MB)

The first failed-to-resume case’s slurm file before manually renaming the restart files: slurm-9249864.txt (947.2 KB)
after manually renaming the restart files:
failed_case1_ioerror.txt (1.6 MB)

The second failed-to-resume case’s slurm file before manually renaming the restart files: failed_case2_rename_error.txt (804.2 KB)
after manually renaming the restart files:
failed_case2_ioerror.txt (206.8 KB)

Cheers,
ziqi

Hi Ziqi,

Thanks again for the detailed follow-up!

Here are two additional suggestions that might help:

  1. File system metadata delay: On high-load Lustre systems, it’s possible for a file to appear in ls listings but still not be accessible due to delayed metadata synchronization. You could try using stat restart.mesh.new or lfs getstripe restart.mesh.new to verify the file’s actual availability. If you don’t have permission to run these commands, it might be helpful to consult the system administrators — they should be able to provide more insight into why the write operation failed.

  2. Try resuming from a different output directory: You can copy the .new files to a clean directory (e.g., output_resume) and set set Output directory = output_resume in your parameter file. This can sometimes help avoid file rename conflicts during checkpoint restarts.

Let me know if you try these approaches and whether they help!

Best,
Ninghui

@maziqi96 I don’t have a particularly good idea what the issue may be – it’s clearly something local to your system, given that in the many years that this code has been in the code base nobody else has run into it.

I looked at your error messages again, and it’s in the call to rename(). rename() is documented to return -1 as an error code if something goes wrong, and specifics about that error in the errno variable (see rename(2) - Linux manual page). We don’t currently output these specifics, which is of course the reason you’ve been poking around blindly for has been causing the issue. But I decided to write a small patch that also outputs these specifics, see here: Also output errno when the rename() call fails. by bangerth · Pull Request #6273 · geodynamics/aspect · GitHub). If you want, you can apply that patch to the version you have and run your models again. This way you should get more information the problem happens next time. It’s not clear to me how we would address the issue, but knowing more about its causes may still be useful.
Best
W.