I am trying to run global convection models with highly resolved upper-mantle structure. The initial temperature model I have created has ~76M points on a structured mesh in spherical coordinates, which I have exported to an ASCII file that is ~2.1 GB. However, I have not yet been able to use it successfully as the input initial temperature model on ASPECT. I’ve used the same export script on lower-resolution models that have run without problems, so I don’t think it’s a problem with formatting etc. of the input file. I’m running the jobs on the STAMPEDE2 cluster on 8 skx nodes.
I’ve gotten two different errors. The first was with an earlier version of the input file that was 4.1 GB:
Loading Ascii data initial file /work2/08512/tg878111/stampede2/software/aspect_work/inputs/tomography/ascii_input.txt.
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(2654): MPI_Bcast(buf=0x2ba14fdc1010, count=-214976017, MPI_CHAR, root=0, comm=0x84000004) failed
PMPI_Bcast(2605): Negative count, value is -214976017
TACC: MPI job exited with code: 2
TACC: Shutdown complete. Exiting.
I then changed my temperature model to use adaptive resolution (still on a structured mesh), which got the file size down to 2.1 GB. Now the problem I get is that I run out of time on my cluster job during the file read, even when I allocate 10 hours. Each skx node has 192 GB of RAM, so it shouldn’t be running out of memory.
Does anyone else have experience working with large input files, or have advice on how to use them gracefully and effectively?
interesting. The error you see is a bug indeed. I’ve got a patch on the way that I will post later. I guess nobody has ever tried to send such large files
As for the timeout, it would be useful to know where that happens. Presumably, something hangs somewhere. Could you put a line of debug output at the beginning of the
read_and_distribute_file() function in
source/utilities.cc to see whether the code actually ever makes it past the reading of the file?
The issue with the large file size is tracked here: https://github.com/geodynamics/aspect/pull/4484
Each skx node has 192 GB of RAM, so it shouldn’t be running out of memory.
I wouldn’t be so sure. Stampede2 has 48 cores, right? This means you have 4 GB per core. You have to store the text data (2GB) plus the parsed data structure, which is probably also around 2GB in size. You can ssh into the nodes and check the memory consumption while the computation hangs.
Unless you are using Wolfgang’s new MPI shared allocation for the tables. Wolfgang, is that working and enabled in ASPECT and if yes, since when?
I was double-checking that just now as well. It isn’t actually in ASPECT – the patch is here, and I’ve just rebase on current
You need the current deal.II dev branch for it, however.
Sam: The issue you are, or will be running into is that storing the information for this one file on each process running on the same machine is of course quite wasteful. A typical simulation will use in the range of 1-2 GB of memory on each MPI process, but now you want to add several more GB to that. It seems reasonable that you are simply running out of memory. That’s what the patch (#4086) tries to address: It keeps this kind of read-only data only once per machine, rather than once for each of the possibly many MPI processes running on the machine. But, to use the patch, you would have to build a reasonably recent version of deal.II and rebuild ASPECT on top of it.
Thank you Wolfgang and Timo for your help! I will try with the patch, and see if it fixes my problem.
I just wanted to say: Thank you Sam for letting us now and and thank you Wolfgang and Timo for working on the fix! I just read a paper last week where @cedrict actually ran into the same problem:
The paper mentions a 2 GB limit for input data files in ASPECT that I wasn’t aware of. So this has been a problem for other applications too, and it’s great to see it being fixed!
I was looking through the patch, and I wanted to clarify - when you say “reasonably recent version of deal.II,” does this mean version 10.0?
10.0 isn’t out yet (and won’t be till around May). Any deal.II dev version from the last few months will do, though. (I hope so at least. I’m looking at the pull request again and there is a test that is failing that I will have to track down.)
I’ve now tried it with both of the new versions (#4086, #4484). My Github knowledge is rather limited and I wasn’t sure how to combine the fixes, so I did each of the two separately. I’m still having the same problem as before, where it times out on the input file read (after 6 hours). Should I wait until all of the changes are merged into the main branch and go from there?
#4086 only does something if you have a sufficiently recent deal.II version. So chances are that it makes no difference to what you are doing already.
What you describe suggests that your program hangs somewhere. Do you know how to debug these problems? This is easiest if you can run your model on a single machine, preferably your local workstation, and if you can reproduce the lock-up with just two processes. I think I talk about the process of finding where a program hangs here: Wolfgang Bangerth's video lectures
I’m new to parallel computing, but thank you for the video! It was very helpful, and I will try to follow your strategy to find out exactly where it’s getting stuck.
Looking at the ASPECT log/command line output, it gives this statement (
Loading Ascii data initial file [...]) twice, once for each of the two input text files (one for initial thermal structure (2GB), once for compositional field (390 MB). The thermal structure, which is the larger of the two files, comes first, so it’s getting through the file read at least. Looking at past successful runs, the next thing that should come is
-- For information on how to cite ASPECT, see:
Number of active cells: 49,152 (on 4 levels)
Number of degrees of freedom: 3,001,642 (1,216,710+52,258+405,570+1,327,104)
So it’s getting stuck somewhere between reading in the files and starting the actual computations. I haven’t changed the resolution of the model mesh, so the only thing that’s bigger/more expensive is the input files. To me, this suggests that it’s getting stuck somewhere in interpolating the inputs onto the model mesh. Once I figure out the debugging workflow, hopefully I can figure out more specifically where the hang-up is.
Sam, the pull request support reading larger than 2GB data files by tjhei · Pull Request #4497 · geodynamics/aspect · GitHub should fix the issue of data files bigger than 2GB. It would be great if you can try it out.
Thanks for trying this out. There are almost certainly 10,000 lines of code or
more between reading in these files and getting the simulation started
Where exactly the hang happens is something that once you understand the
workflow is not all that difficult to find: You just start the program in the
debugger, wait for a couple of hours, stop the program, and get a backtrace
from all MPI jobs. These backtraces will be the key in telling where the
Like Timo says, it would be nice to try this with his patch applied.
I tried Timo’s new patch (#4497) and it successfully gets past the first error I was getting (
Fatal error in PMPI_Bcast), even with the 4 GB file, but it is still getting hung up after reading in the file. I’m working on installing ASPECT on my local machine to debug and find out where.
I forgot to remove the Assert() Wolfgang added, my bad. Can you try again with the updated pull request? It is working for me at least.
I’ve set up ASPECT on my local desktop. I haven’t been able to reproduce the hang-up that I was getting on STAMPEDE2, but I did get this error after reading in the two files:
An error occurred in line <452> of file </home/sgoldberg/aspect/aspect-bigbcast/source/simulator/core.cc> in function
aspect::Simulator<dim>::Simulator(MPI_Comm, dealii::ParameterHandler&) [with int dim = 3; MPI_Comm = ompi_communicator_t*]
The violated condition was:
geometry_model->natural_coordinate_system() == Utilities::Coordinates::CoordinateSystem::cartesian
The limiter for the discontinuous temperature and composition
solutions has not been tested in non-Cartesian geometries and
currently requires the use of a Cartesian geometry model.
When I remove the line
set Use discontinuous composition discretization = true from my parameter file, it runs just fine and starts the computation as it should. This is all with the new patch (#4497).
Unfortunately STAMPEDE2 is offline until later this week so I can’t try it on there just yet, but hopefully this was the problem and I can get it to work.
STAMPEDE2 is back online now. I tried running the same set-up that I got to work on my local machine (same parameter file, input text files, and source code (#4497)), and it’s still getting hung up during or immediately after the second file read (for the compositional field). I can’t seem to reproduce this problem on my local machine. Any suggestions on how to debug on the cluster?
Are you sure you got the updated version from the PR (otherwise the code will hang) and compiled?
I would suggest you first try it with a small input file and see if that works. If not, you could add some print statements around the file loading to help identify how far you got.
I’m pretty sure, but I’m redownloading it and compiling it to check again. The job to run the new version of ASPECT and see if it works is in the queue, so hopefully I’ll find out soon.
To make sure I’m doing it right: I’m going to this page: https://github.com/tjhei/aspect/tree/bigbcast and downloading the .zip file of the source code. I’m compiling it following the instructions here (https://github.com/geodynamics/aspect/wiki/Compiling-and-Running-ASPECT-on-TACC-Stampede2) using the commands near the end:
cmake -DCMAKE_BUILD_TYPE=Release -DDEAL_II_DIR=$WORK2/software/candi/install/ ../
I’m using the same copy of deal.ii (v9.3.0) as I was using before this issue arose, which I downloaded on November 18.