Stampede2 crash debugging

I have been trying to run an Aspect problem on Stampede2. However, I am currently running into a crash when building the Stokes preconditioner.

I am running candi deal.II 9.1.1(typo earlier) based on the wiki guide for setting up Aspect on stampede, with the Aspect version used being the branch in PR 3266.

I am currently running in Release mode, and the only error text produced is(removing some personal path segments):

ML::FATAL ERROR:: 1, /${INSTALLPATH}/candi-install/tmp/unpack/Trilinos-trilinos-release-12-10-1/packages/ml/src/Utils/ml_MultiLevelPreconditioner_NullSpace.cpp, line 98

Checking the file in question, this appears to be produced when an error should also be generated indicating that the null space vectors are NULL.

Does anyone have any suggestions about either what the problem might be, or what I could do to isolate the source of the problem?


Configuration:

Module files loaded are:

Currently Loaded Modules:
  1) git/2.9.0   2) autotools/1.1   3) xalt/2.7.9   4) TACC   5) cmake/3.7.1   6) gcc/7.1.0   7) impi/17.0.3   8) mkl/17.0.4   9) python/2.7.13  10) libfabric/1.7.0  11) python2/2.7.14

Further testing notes:

  • Test problem use_full_A_block_preconditioner completed successfully for 1, and 2 processes.
  • Running the same problem considered with 1 and 2 processes while specifying no mesh refinement did not produce the error. Currently attempting 4, then will try 256. Note that the original error occured before any refinement attempts could be made during the preconditioner building step.
  • Problem occurs when running for 256 processes. An unsubstantiated guess at the cause is that the problem may be due to the number of cells on each processor.
  • Problem occurs with the use_full_A_block_preconditioner problem altered for one additional level of refinement for 256 cores, but does not occur for 128.

Hi Jon,

I have not seen this error before. Can you confirm that you are in fact using deal 8.1.1 and not 9.1.1?

John

I think I made a typo, and it should state that I am using 9.1.1.

I am currently looking into rebuilding Aspect in Debug mode, and seeing if that produces a more useful error.


Aspect in Debug mode does not change the error output.

Yes – always test in debug mode!
W.

To clarify, the problem in question was run on another machine (though not on the precise same checkout) beforehand without producing an error, and another problem was successfully run with build used, so I belived I had the correct configuration.

As I noted in my edit, there was no significant change in the error output when Aspect compiled in debug mode.

I would like to be copied on this discussion. Is there a standard way to do it other than referencing myself as in @egpuckett. (I hope that works … :slight_smile: )

@egpuckett look for the “!” icon to the right of this message. Click on it and select “Watching”. You can do the same at the category level if you wish to follow the entire ASPECT forum.

Yes, I have seen this exact error message on Frontera (very similar TACC system to Stampede2). I haven’t been able to reproduce it, though. Can you post what compiler and MPI library you are using? Is this with DEAL_II_WITH_64BIT_INDICES?

I do not believe I currently have “DEAL_II_WITH_64BIT_INDICIES” currently set, though I will need to set it now that I consider it.

Module files loaded are:

Currently Loaded Modules:
  1) git/2.9.0   2) autotools/1.1   3) xalt/2.7.9   4) TACC   5) cmake/3.7.1   6) gcc/7.1.0   7) impi/17.0.3   8) mkl/17.0.4   9) python/2.7.13  10) libfabric/1.7.0  11) python2/2.7.14

Further testing notes:

  • Test problem use_full_A_block_preconditioner completed successfully for 1, and 2 processes.
  • Running the same problem considered with 1 and 2 processes while specifying no mesh refinement did not produce the error. Currently attempting 4, then will try 256. Note that the original error occured before any refinement attempts could be made during the preconditioner building step.
  • Problem occurs when running for 256 processes. An unsubstantiated guess at the cause is that the problem may be due to the number of cells on each processor.
  • Problem occurs with the use_full_A_block_preconditioner problem altered for one additional level of refinement for 256 cores, but does not occur for 128.

I have added some further testing notes which suggest that we may be hitting an error which is dependent on the number of cells per process.