Petsc Error running on cluster

Hi all,

I have successfully configure pylith on the cluster. However when I try to run a job with large number of dofs, with 50 processors. Here is the command I use to submit the batch script job on a PBS system.

pylith pylithapp.cfg ~/.pyre/pylithapp/pylithapp.cfg --job.queue=secondary --job.name=example9 --job.stdout=example9.log --job.stderr=example9.err --job.walltime=60*minute --scheduler.ppn=10 --nodes=50

module load mvapich2/2.3-intel-18.0

I am attaching the log file and error file, hope this would help to see what I may be wrong here .

example9.log (5.0 KB) example9.log (5.0 KB) example9_out.txt (81.3 KB)

It looks like there are HDF5 errors in writing a dataset:

[28]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[28]PETSC ERROR: Error in external library
[28]PETSC ERROR: Error in HDF5 call H5Dcreate2() Status -1
[28]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[28]PETSC ERROR: Petsc Development GIT revision: v3.7.6-4826-gd686aaf  GIT Date: 2017-08-03 14:01:44 -0500
[28]PETSC ERROR: /home/xiaoma5/project-beckman/pylith_home/pylith/bin/mpinemesis on a arch-pylith named golub035 by xiaoma5 Sat Jun  1 00:02:01 2019
[28]PETSC ERROR: Configure options --prefix=/home/xiaoma5/project-beckman/pylith_home/pylith --with-c2html=0 --with-x=0 --with-clanguage=C --with-mpicompilers=1 --with-shared-libraries=1 --with-64-bit-points=1 --with-large-file-io=1 --download-chaco=1 --download-ml=1 --download-f2cblaslapack=1 --with-hdf5=1 --with-hdf5-dir=/home/xiaoma5/project-beckman/pylith_home/pylith --with-zlib=1 --LIBS=-lz --with-debugging=0 --with-fc=0 CPPFLAGS="-I/home/xiaoma5/project-beckman/pylith_home/pylith/include -I/home/xiaoma5/project-beckman/pylith_home/pylith/include " LDFLAGS="-L/home/xiaoma5/project-beckman/pylith_home/pylith/lib -L/home/xiaoma5/project-beckman/pylith_home/pylith/lib64 -L/home/xiaoma5/project-beckman/pylith_home/pylith/lib -L/home/xiaoma5/project-beckman/pylith_home/pylith/lib64 " CFLAGS="-g -O2" CXXFLAGS="-g -O2 -DMPICH_IGNORE_CXX_SEEK" FCFLAGS= PETSC_DIR=/home/xiaoma5/project-beckman/pylith_home/build/pylith/petsc-pylith PETSC_ARCH=arch-pylith
[28]PETSC ERROR: #1 VecView_MPI_HDF5() line 677 in /projects/beckman/xiaoma5/pylith_home/build/pylith/petsc-pylith/src/vec/vec/impls/mpi/pdvec.c
[28]PETSC ERROR: #2 VecView_MPI() line 817 in /projects/beckman/xiaoma5/pylith_home/build/pylith/petsc-pylith/src/vec/vec/impls/mpi/pdvec.c
[28]PETSC ERROR: #3 VecView() line 583 in /projects/beckman/xiaoma5/pylith_home/build/pylith/petsc-pylith/src/vec/vec/interface/vector.c
[28]PETSC ERROR: #4 virtual void pylith::meshio::DataWriterHDF5::open(const pylith::topology::Mesh&, int, const char*, int)() line 230 in ../../../pylith-2.2.1/libsrc/pylith/meshio/DataWriterHDF5.cc

It looks like you are using DataWriterHDF5 for output. On a cluster for large runs we strongly recommend using DataWriterHDF5Ext because it uses low-level MPI output to dump the data and then wraps it in a small HDF5 file with the metadata, also known as HDF5 with external datasets. This writer is much more robust.

Also make sure you are writing to a filesystem that supports parallel output via MPI. Check with the system administrator of your cluster if you are unsure.