Running problem on cluster with Petsc Error " Memory access Out of range"

stdout.txt (5.1 KB) stderr.txt (256.3 KB)

Hi all,

I have run into the attached error. I am running a large size problem on a cluster, even when I switch to run a smaller size problem, this error still exists, I ask for 10 nodes, and for each nodes I ask for 4 processors, just for testing.
“[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range”

There are 433200 cells in the domain, is it simply because I don’t have enough memory on each node ? I am using a coarse mesh for input, and then use the refiner to get a finer mesh, from my understanding, this refinement is done after the mesh is distributed. And if I don’t turn off the refiner using the coarse mesh, it works fine.

This is the command that I use to run the problem:
pylith batch.cfg pylithapp.cfg --nodes=40 --scheduler.ppn=4 --job.walltime=4*hour

Here is the batch.cfg :

scheduler = pbs

[pylithapp.pbs]

shell = /bin/bash

qsub-options = -V -m bea -M xiaoma5@illinois.edu

[pylithapp.launcher]

command = mpirun -np {nodes} -machinefile {PBS_NODEFILE}

I am also attaching the input file.
pylithapp.cfg (7.1 KB)

Another information is even though with these error pops up, the program didn’t get killed, it hangs there and provide the initial time step results. But as soon as it advance to the next time step, it hangs.

Best,
Xiao

Hi, that SEGV usually indicates that there is an invalid memory access, so possibly a PyLith bug. There are a couple things you can do to narrow it down. If you can make it give an error on an interactive machine, its easy to get a stack trace, just run with

–petsc.start_in_debugger

and then type ‘cont’ when it pops up. If the large number of windows is annoying, you might be able to get away with using

–petsc.debugger_nodes=0

to get just one window. This should still work in a batch queue if you can get an X display to attach, which you might have to ask your admin about. If you can run interactively, it might even be possible to run using vlagrind, which would be the best option.

Let me know how it goes,

 Matt

If the coarse mesh works using the same number of nodes/processes, then the likely problem is that you are running out of memory on one node/process. The most likely place this will happen is when the mesh is being distributed. If you have journals turned on so you can see the progress (sent to stdout) and you don’t get past mesh distribution, then it is very likely a running out of memory issue. If the error is during initialization, then be sure to check earlier in the output for additional information (whenever PyLith throws an exception with an error message, PETSc will report a SEGV error). Finally, if the simulation is going through time steps, then it is likely to be an error or a bug and not a memory issue. You will need to carefully look at the stdout and stderr logs to see what is going on.