I have been trying to run some computations on our cluster (peloton2) that involve resuming from a checkpoint and have been seeing some bizarre behavior where chunks assigned to different mpi processes appear to be swapped. I’ve attached a visualization file from immediately after the computation resumes. I am not sure yet whether the processor chunks are actually swapped in the solution or just in the visualization output. I write my visualization files in hdf5, which I know is less common among aspect useres but I have found that these files are much easier to manipulate and postprocess using matlab and python. Is this a known issue, and do you have any ideas about how to go about debugging the problem?
I am using aspect 2.2.0-pre (8ff16a9f5)
deal.ii 9.2.0-pre (78d1660594)
That is a bizarre problem.
So a few questions to narrow things down:
- Is it reproducible?
- Does it also happen with other viz file formats?
I’ll not that it doesn’t look like the three blocks in question are just
translated – they’re also garbled, which is probably because the order
of degrees of freedom in the three blocks does not follow the same
Yes, it is reproducible. It is happening consistently on resume from checkpoint. I have not checked that the problem occurs on other machines or with other mpi stacks on this cluster.
It does look like the solution itself is corrupted based on how the model evolves after resuming from checkpoint. Immediately following the checkpoint, it looks like subdomains are just swapped.
Do you have an idea if the solution is corrupted or just the graphical output? That would help in diagnosing the error. I know that hdf has a logic to only write solution vectors if the mesh did not change (this could be broken after checkpoint). If the solution itself is wrong, then we have a bigger issue. Is this example simple enough that you can share it?
I just ran tests and the artifacts do not appear when using .vtu output. I will try to make a quick modification (for testing) to write the mesh with every hdf5 output and see if the problem disappears.