Resume computation fails

Hi all.
I am using ASPECT for my geodynamics coursework.
I tried to install it on an old cluster then I encountered some problems, but I found some information on the Internet to fixed them.
After the installation is complete, I tested some tasks and it seems to work well until I found those problems:

  1. Only when I set Number of grouped file = 0, the generated vtu files can be read by paraview. (But it seems that this problem does not exist when the process number is less than 8)

  2. The task cannot be restarted. When I used qsub to submit jobs I got error message as follows:


– This is ASPECT, the Advanced Solver for Problems in Earth’s ConvecTion.
– . version 2.2.0-pre
– . using deal.II 9.0.1
– . with 32 bit indices and vectorization level 1 (128 bits)
– . using Trilinos 12.10.1
– . using p4est 2.0.0
– . running in OPTIMIZED mode
– . running with 8 MPI processes

*** Resuming from snapshot!

[node5:13086] *** Process received signal ***
[node5:13086] Signal: Segmentation fault (11)
[node5:13086] Signal code: Address not mapped (1)
[node5:13086] Failing at address: 0x103c681aa
[node5:13086] [ 0] /lib64/libpthread.so.0[0x3dacc0f710]
[node5:13086] [ 1] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libp4est-2.0.so(p4est_inflate+0x319)[0x2ba7e1923d59]
[node5:13086] [ 2] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libp4est-2.0.so(p4est_source_ext+0x700)[0x2ba7e1900bc0]
[node5:13086] [ 3] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libp4est-2.0.so(p4est_load_ext+0x6f)[0x2ba7e1900fcf]
[node5:13086] [ 4] /home/ming/software/buildnew/install/lib/libdeal_II.so.9.0.1(_ZN6dealii8parallel11distributed13TriangulationILi2ELi2EE4loadEPKcb+0x28f)[0x2ba7dff8b7ff]
[node5:13086] [ 5] /public/home/ming/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE20resume_from_snapshotEv+0x114)[0xbb77f4]
[node5:13086] [ 6] /public/home/ming/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x3f3)[0xca72f3]
[node5:13086] [ 7] /public/home/ming/aspect/build/aspect(_Z13run_simulatorILi2EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_bbb+0x1d9)[0x10670d9]
[node5:13086] [ 8] /public/home/ming/aspect/build/aspect(main+0x8a4)[0x1070894]
[node5:13086] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3dac81ed5d]
[node5:13086] [10] /public/home/ming/aspect/build/aspect[0x8633ad]
[node5:13086] *** End of error message ***

Then I used mpirun command and got the following message(I tried to run with 1MPI process and it can be restarted)

Abort: invalid forest
Abort: /public/home/ming/deal.ii-candi/tmp/unpack/p4est-2.0/src/p4est.c:3749
Abort

[admin:24622] *** Process received signal ***
[admin:24622] Signal: Aborted (6)
[admin:24622] Signal code: (-6)
SIGABRT received
SIGABRT received
SIGABRT received
SIGABRT received
SIGABRT received
SIGABRT received
SIGABRT received
[admin:24622] [ 0] /lib64/libpthread.so.0[0x3ceea0f7e0]
[admin:24622] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3cee6324f5]
[admin:24622] [ 2] /lib64/libc.so.6(abort+0x175)[0x3cee633cd5]
[admin:24622] [ 3] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libsc-2.0.so(+0xaf31)[0x7f206232af31]
[admin:24622] [ 4] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libsc-2.0.so(sc_abort+0xa)[0x7f206232a0ca]
[admin:24622] [ 5] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libsc-2.0.so(+0xa5a0)[0x7f206232a5a0]
[admin:24622] [ 6] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libp4est-2.0.so(p4est_source_ext+0xa7d)[0x7f2062567f3d]
[admin:24622] [ 7] /public/home/ming/deal.ii-candi/p4est-2.0/FAST/lib/libp4est-2.0.so(p4est_load_ext+0x6f)[0x7f2062567fcf]
[admin:24622] [ 8] /home/ming/software/build/install/lib/libdeal_II.so.9.0.1(_ZN6dealii8parallel11distributed13TriangulationILi2ELi2EE4loadEPKcb+0x28f)[0x7f20699297ff]
[admin:24622] [ 9] ./aspect(_ZN6aspect9SimulatorILi2EE20resume_from_snapshotEv+0x114)[0xbb77f4]
[admin:24622] [10] ./aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x3f3)[0xca72f3]
[admin:24622] [11] ./aspect(_Z13run_simulatorILi2EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_bbb+0x1d9)[0x10670d9]
[admin:24622] [12] ./aspect(main+0x8a4)[0x1070894]
[admin:24622] [13] /lib64/libc.so.6(__libc_start_main+0x100)[0x3cee61ed20]
[admin:24622] [14] ./aspect[0x8633ad]
[admin:24622] *** End of error message ***

I guess there is something wrong with my installation…

I would appreciate it if you could give me some advice.
Warm regards,
Ming.

Hi Ming,

Thank you for posting the question here!

In the first part of the simulation (e…g, prior to attempting a restart), were “restart” files written to the output folder?

The reason you need to use “set Number of grouped files = 0” is that the cluster file system does not support MPI-IO. If this is the case, restart files cannot be written unless one modifies p4est.

A number of us have successfully dealt with this by adding “–disable-mpiio” to lines 76 and 90 in candi/deal.II-toolchain/packages/p4est.package, assuming you build deal.II with candi.

Cheers,
John

Hi John,

Thanks for your suggestion.
I just tried to rebuild p4est as you said and now it can be restarted.

Cheers,
Ming

Hi Ming,
great that this fixed your problem. I would like to note that not using MPI I/O can lead to degraded performance when running computations (especially when they are large) and changing this setting in p4est is a work-around that might cause other problems (incorrect restart files could potentially happen).
So, longer term, it might be worth figuring out with your admins why MPI I/O does not work on your machine.