Floating point exception in Trilinos ML after modifying Trilinos build flags to resolve BLAS float issue on cluster COSMA

Dear developers,

I am facing a runtime issue with ASPECT on a new HPC cluster (Cosma) and would appreciate any insight. Below I describe the installation history and the problem in detail.

1. Initial problem: Trilinos build failure in Candi

I am installing deal.II and ASPECT using Candi. During the Trilinos configuration step, I encountered the following error:

“Tpetra: Tpetra_INST_FLOAT is ON, but HAVE_TEUCHOS_BLASFLOAT is OFF.
This means that you are linking with a BLAS library that lacks float (S) support.
Tpetra needs a BLAS implementation that supports float.

– Configuring incomplete, errors occurred!
Failure with exit status: 1”

So Trilinos could not be configured because the BLAS library available on the cluster does not provide float precision support.

2. Workaround suggested by cluster support

The cluster support team suggested modifying the Candi Trilinos configuration by disabling float instantiations:

“-D Trilinos_ENABLE_FLOAT:BOOL=OFF” instead -D Trilinos_ENABLE_FLOAT:BOOL=ON

They also suggested some additional changes in PETSc related to 64-bit indices and MPI compiler settings.

After applying these changes and rebuilding from a clean clone, I was able to successfully install:

Trilinos,deal.II, ASPECT

3. New problem: ASPECT runtime crash (Floating Point Exception)

Although compilation succeeded, ASPECT crashes at runtime with a floating point exception.

The simulation starts normally, creates the output directory, and prints mesh and DoF information:

“– This is ASPECT –
– The Advanced Solver for Planetary Evolution, Convection, and Tectonics. –

– . version 3.1.0-pre (main, 0b101bb19)

– . using deal.II 9.7.0
– . with 32 bit indices
– . with vectorization level 3 (AVX512, 8 doubles, 512 bits)
– . using Trilinos 16.1.0
– . using p4est 2.3.6
– . using Geodynamic World Builder 1.0.0
– . running in DEBUG mode
– . running with 1 MPI process


The output directory <output-continental_extension/> provided in the input file appears not to exist.

ASPECT will create it for you.


– For information on how to cite ASPECT, see:

The ASPECT mantle convection code: How to cite?

Number of active cells: 3,200 (on 4 levels)
Number of degrees of freedom: 107,649 (26,082+3,321+13,041+13,041+13,041+13,041+13,041+13,041)

Number of mesh deformation degrees of freedom: 6,642”

With job crashes

“[m5001:2709721:0:2709721] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:2709721) ====
0 0x000000000003ebf0 _GI___sigaction() :0
1 0x0000000000004e85 ddot
() ???:0
2 0x0000000000194b03 ML_gdot() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Utils/ml_utils.c:1600
3 0x000000000013eb13 ML_CG_ComputeEigenvalues() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Krylov/ml_cg.c:316
4 0x0000000000143de0 ML_Krylov_Solve() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Krylov/ml_krylov.c:358
5 0x00000000000a5fee ML_AGG_Gen_Prolongator() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Coarsen/ml_agg_genP.c:511
6 0x00000000000a98ef ML_MultiLevel_Gen_Prolongator() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Coarsen/ml_agg_genP.c:3594
7 0x00000000000a109c ML_Gen_MultiLevelHierarchy() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Coarsen/ml_agg_genP.c:3153
8 0x00000000000a3f24 ML_Gen_MultiLevelHierarchy_UsingAggregation() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Coarsen/ml_agg_genP.c:2994
9 0x00000000001e3c2c ML_Epetra::MultiLevelPreconditioner::ComputePreconditioner() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Utils/ml_MultiLevelPreconditioner.cpp:2413
10 0x00000000001e71aa ML_Epetra::MultiLevelPreconditioner::MultiLevelPreconditioner() /cosma/apps/durham/dc-roy3/softwares/candi_9.5.1-r1/candi/build/tmp/unpack/Trilinos-trilinos-release-16-1-0/packages/ml/src/Utils/ml_MultiLevelPreconditioner.cpp:356
11 0x000000001b685c62 dealii::TrilinosWrappers::PreconditionAMG::initialize() ???:0
12 0x000000001b685a5d dealii::TrilinosWrappers::PreconditionAMG::initialize() ???:0
13 0x000000001b6859db dealii::TrilinosWrappers::PreconditionAMG::initialize() ???:0
14 0x00000000047059f9 aspect::MeshDeformation::MeshDeformationHandler<2>::compute_mesh_displacements() ???:0
15 0x0000000004700e5f aspect::MeshDeformation::MeshDeformationHandler<2>::setup_dofs() ???:0
16 0x0000000003eed63d aspect::Simulator<2>::setup_dofs() ???:0
17 0x0000000003ef0549 aspect::Simulator<2>::run() ???:0”

I would like to ask:

  1. Is disabling Trilinos_ENABLE_FLOAT compatible with deal.II and ASPECT?

  2. Could the floating point exception be caused by this modification in Trilinos?

  3. Is there any wayout to install Deal.ii and ASPECT ?

Any leads would be really appreciated!

Best regards,

Poulami Roy

Hi Poulami,

I ran into similar issue recently (during Trilinos build), as I was first linking against MPI-version of BLAS. Changed into the sequential BLAS implementation, and that was enough to get things sorted out. So far haven’t seen any issues while running ASPECT and the performance looks good as well, afaik deal.II doesn’t even employ MPI-parallellisation stuff in the BLAS library itself.

Alternatively, you can build OpenBLAS along candi by specifying once:openblas in the package list in your local.cfg or candi.cfg file.

best,

Leevi

Hi @Poulami ,

I encountered a similar issue when using Intel MKL(Fix MKL-related floating-point configuration conflicts in Trilinos build), where the solution was to disable some floating-point related options in Trilinos, such as turning off Trilinos_ENABLE_FLOAT and related Tpetra float instantiations. For example:

if [ ${MKL} = ON ]; then
    CONFOPTS="${CONFOPTS} \
      ...
      -D Tpetra_INST_FLOAT:BOOL=OFF \
      -D Trilinos_ENABLE_FLOAT:BOOL=OFF \
      -D Tpetra_INST_COMPLEX_FLOAT:BOOL=OFF"
      ...
else
    ...
    CONFOPTS="${CONFOPTS} -D Trilinos_ENABLE_FLOAT:BOOL=ON"
    ...
fi

I understand you might not be using MKL specifically, but you could try a similar approach of disabling floating-point support in Trilinos. This often helps avoid build or runtime errors caused by incomplete float BLAS support.

Is there any wayout to install Deal.ii and ASPECT ?

You may try the latest candi installer for installing deal.II. If you are uncertain whether your cluster’s BLAS library is compatible, candi also offers the option to automatically install OpenBLAS. This can help avoid build or runtime issues caused by system BLAS lacking float support.

If this does not work, could you please share the list of loaded modules (module list) or details about the BLAS library version and path you are using?.

I hope this helps!

Best,
Ninghui