Stampede2 crash debugging

class4kayaker · October 21, 2019, 8:32pm

I have been trying to run an Aspect problem on Stampede2. However, I am currently running into a crash when building the Stokes preconditioner.

I am running candi deal.II 9.1.1(typo earlier) based on the wiki guide for setting up Aspect on stampede, with the Aspect version used being the branch in PR 3266.

I am currently running in Release mode, and the only error text produced is(removing some personal path segments):

ML::FATAL ERROR:: 1, /${INSTALLPATH}/candi-install/tmp/unpack/Trilinos-trilinos-release-12-10-1/packages/ml/src/Utils/ml_MultiLevelPreconditioner_NullSpace.cpp, line 98

Checking the file in question, this appears to be produced when an error should also be generated indicating that the null space vectors are NULL.

Does anyone have any suggestions about either what the problem might be, or what I could do to isolate the source of the problem?

Configuration:

Deal.II 9.1.1 candi install using guide from wiki
Aspect at now merged branch from PR 3266

Module files loaded are:

Currently Loaded Modules:
  1) git/2.9.0   2) autotools/1.1   3) xalt/2.7.9   4) TACC   5) cmake/3.7.1   6) gcc/7.1.0   7) impi/17.0.3   8) mkl/17.0.4   9) python/2.7.13  10) libfabric/1.7.0  11) python2/2.7.14

Further testing notes:

Test problem use_full_A_block_preconditioner completed successfully for 1, and 2 processes.
Running the same problem considered with 1 and 2 processes while specifying no mesh refinement did not produce the error. Currently attempting 4, then will try 256. Note that the original error occured before any refinement attempts could be made during the preconditioner building step.
Problem occurs when running for 256 processes. An unsubstantiated guess at the cause is that the problem may be due to the number of cells on each processor.
Problem occurs with the use_full_A_block_preconditioner problem altered for one additional level of refinement for 256 cores, but does not occur for 128.

jbnaliboff · October 21, 2019, 8:51pm

Hi Jon,

I have not seen this error before. Can you confirm that you are in fact using deal 8.1.1 and not 9.1.1?

John

class4kayaker · October 21, 2019, 8:55pm

I think I made a typo, and it should state that I am using 9.1.1.

I am currently looking into rebuilding Aspect in Debug mode, and seeing if that produces a more useful error.

Aspect in Debug mode does not change the error output.

bangerth · October 21, 2019, 10:58pm

Yes – always test in debug mode!
W.

class4kayaker · October 22, 2019, 12:36am

To clarify, the problem in question was run on another machine (though not on the precise same checkout) beforehand without producing an error, and another problem was successfully run with build used, so I belived I had the correct configuration.

As I noted in my edit, there was no significant change in the error output when Aspect compiled in debug mode.

egpuckett · October 22, 2019, 12:57am

I would like to be copied on this discussion. Is there a standard way to do it other than referencing myself as in @egpuckett. (I hope that works … )

ljhwang · October 22, 2019, 2:35am

@egpuckett look for the “!” icon to the right of this message. Click on it and select “Watching”. You can do the same at the category level if you wish to follow the entire ASPECT forum.

tjhei · October 22, 2019, 4:11pm

Yes, I have seen this exact error message on Frontera (very similar TACC system to Stampede2). I haven’t been able to reproduce it, though. Can you post what compiler and MPI library you are using? Is this with DEAL_II_WITH_64BIT_INDICES?

class4kayaker · October 22, 2019, 4:27pm

I do not believe I currently have “DEAL_II_WITH_64BIT_INDICIES” currently set, though I will need to set it now that I consider it.

Module files loaded are:

Currently Loaded Modules:
  1) git/2.9.0   2) autotools/1.1   3) xalt/2.7.9   4) TACC   5) cmake/3.7.1   6) gcc/7.1.0   7) impi/17.0.3   8) mkl/17.0.4   9) python/2.7.13  10) libfabric/1.7.0  11) python2/2.7.14

Further testing notes:

Test problem use_full_A_block_preconditioner completed successfully for 1, and 2 processes.
Running the same problem considered with 1 and 2 processes while specifying no mesh refinement did not produce the error. Currently attempting 4, then will try 256. Note that the original error occured before any refinement attempts could be made during the preconditioner building step.
Problem occurs when running for 256 processes. An unsubstantiated guess at the cause is that the problem may be due to the number of cells on each processor.
Problem occurs with the use_full_A_block_preconditioner problem altered for one additional level of refinement for 256 cores, but does not occur for 128.

class4kayaker · October 25, 2019, 6:01pm

I have added some further testing notes which suggest that we may be hitting an error which is dependent on the number of cells per process.

Topic		Replies	Views
Error running and compiling ASPECT on Stampede2 ASPECT installation	8	230	July 31, 2023
Issue with benchmark 5.4.12 in manual about parameter "Linear solver A block tolerance" ASPECT	1	354	September 16, 2019
An unforeseen killing while running ASPECT	5	33	June 20, 2025
Errors in installing Aspect on TACC-Frontera ASPECT installation	5	245	March 4, 2023
Mpirun noticed that process rank 2 with PID 0 on node dealvm exited on signal 9 (Killed) ASPECT	3	842	July 18, 2021

Stampede2 crash debugging

Configuration:

Related topics