I am trying to run global convection models with highly resolved upper-mantle structure. The initial temperature model I have created has ~76M points on a structured mesh in spherical coordinates, which I have exported to an ASCII file that is ~2.1 GB. However, I have not yet been able to use it successfully as the input initial temperature model on ASPECT. I’ve used the same export script on lower-resolution models that have run without problems, so I don’t think it’s a problem with formatting etc. of the input file. I’m running the jobs on the STAMPEDE2 cluster on 8 skx nodes.
I’ve gotten two different errors. The first was with an earlier version of the input file that was 4.1 GB:
Loading Ascii data initial file /work2/08512/tg878111/stampede2/software/aspect_work/inputs/tomography/ascii_input.txt.
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(2654): MPI_Bcast(buf=0x2ba14fdc1010, count=-214976017, MPI_CHAR, root=0, comm=0x84000004) failed
PMPI_Bcast(2605): Negative count, value is -214976017
TACC: MPI job exited with code: 2
TACC: Shutdown complete. Exiting.
I then changed my temperature model to use adaptive resolution (still on a structured mesh), which got the file size down to 2.1 GB. Now the problem I get is that I run out of time on my cluster job during the file read, even when I allocate 10 hours. Each skx node has 192 GB of RAM, so it shouldn’t be running out of memory.
Does anyone else have experience working with large input files, or have advice on how to use them gracefully and effectively?
Sam:
interesting. The error you see is a bug indeed. I’ve got a patch on the way that I will post later. I guess nobody has ever tried to send such large files
As for the timeout, it would be useful to know where that happens. Presumably, something hangs somewhere. Could you put a line of debug output at the beginning of the read_and_distribute_file() function in source/utilities.cc to see whether the code actually ever makes it past the reading of the file?
Each skx node has 192 GB of RAM, so it shouldn’t be running out of memory.
I wouldn’t be so sure. Stampede2 has 48 cores, right? This means you have 4 GB per core. You have to store the text data (2GB) plus the parsed data structure, which is probably also around 2GB in size. You can ssh into the nodes and check the memory consumption while the computation hangs.
Unless you are using Wolfgang’s new MPI shared allocation for the tables. Wolfgang, is that working and enabled in ASPECT and if yes, since when?
You need the current deal.II dev branch for it, however.
Sam: The issue you are, or will be running into is that storing the information for this one file on each process running on the same machine is of course quite wasteful. A typical simulation will use in the range of 1-2 GB of memory on each MPI process, but now you want to add several more GB to that. It seems reasonable that you are simply running out of memory. That’s what the patch (#4086) tries to address: It keeps this kind of read-only data only once per machine, rather than once for each of the possibly many MPI processes running on the machine. But, to use the patch, you would have to build a reasonably recent version of deal.II and rebuild ASPECT on top of it.
I just wanted to say: Thank you Sam for letting us now and and thank you Wolfgang and Timo for working on the fix! I just read a paper last week where @cedrict actually ran into the same problem:
The paper mentions a 2 GB limit for input data files in ASPECT that I wasn’t aware of. So this has been a problem for other applications too, and it’s great to see it being fixed!
10.0 isn’t out yet (and won’t be till around May). Any deal.II dev version from the last few months will do, though. (I hope so at least. I’m looking at the pull request again and there is a test that is failing that I will have to track down.)
Best
W.
I’ve now tried it with both of the new versions (#4086, #4484). My Github knowledge is rather limited and I wasn’t sure how to combine the fixes, so I did each of the two separately. I’m still having the same problem as before, where it times out on the input file read (after 6 hours). Should I wait until all of the changes are merged into the main branch and go from there?
#4086 only does something if you have a sufficiently recent deal.II version. So chances are that it makes no difference to what you are doing already.
What you describe suggests that your program hangs somewhere. Do you know how to debug these problems? This is easiest if you can run your model on a single machine, preferably your local workstation, and if you can reproduce the lock-up with just two processes. I think I talk about the process of finding where a program hangs here: Wolfgang Bangerth's video lectures
I’m new to parallel computing, but thank you for the video! It was very helpful, and I will try to follow your strategy to find out exactly where it’s getting stuck.
Looking at the ASPECT log/command line output, it gives this statement (Loading Ascii data initial file [...]) twice, once for each of the two input text files (one for initial thermal structure (2GB), once for compositional field (390 MB). The thermal structure, which is the larger of the two files, comes first, so it’s getting through the file read at least. Looking at past successful runs, the next thing that should come is
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
-- https://aspect.geodynamics.org/citing.html?ver=2.3.0&dg=1&geoid=1&sha=42505d0e3&src=code
-----------------------------------------------------------------------------
Number of active cells: 49,152 (on 4 levels)
Number of degrees of freedom: 3,001,642 (1,216,710+52,258+405,570+1,327,104)
So it’s getting stuck somewhere between reading in the files and starting the actual computations. I haven’t changed the resolution of the model mesh, so the only thing that’s bigger/more expensive is the input files. To me, this suggests that it’s getting stuck somewhere in interpolating the inputs onto the model mesh. Once I figure out the debugging workflow, hopefully I can figure out more specifically where the hang-up is.
Thanks for trying this out. There are almost certainly 10,000 lines of code or
more between reading in these files and getting the simulation started
Where exactly the hang happens is something that once you understand the
workflow is not all that difficult to find: You just start the program in the
debugger, wait for a couple of hours, stop the program, and get a backtrace
from all MPI jobs. These backtraces will be the key in telling where the
problem is.
Like Timo says, it would be nice to try this with his patch applied.
I tried Timo’s new patch (#4497) and it successfully gets past the first error I was getting ( Fatal error in PMPI_Bcast), even with the 4 GB file, but it is still getting hung up after reading in the file. I’m working on installing ASPECT on my local machine to debug and find out where.
I’ve set up ASPECT on my local desktop. I haven’t been able to reproduce the hang-up that I was getting on STAMPEDE2, but I did get this error after reading in the two files:
An error occurred in line <452> of file </home/sgoldberg/aspect/aspect-bigbcast/source/simulator/core.cc> in function
aspect::Simulator<dim>::Simulator(MPI_Comm, dealii::ParameterHandler&) [with int dim = 3; MPI_Comm = ompi_communicator_t*]
The violated condition was:
geometry_model->natural_coordinate_system() == Utilities::Coordinates::CoordinateSystem::cartesian
Additional information:
The limiter for the discontinuous temperature and composition
solutions has not been tested in non-Cartesian geometries and
currently requires the use of a Cartesian geometry model.
When I remove the line set Use discontinuous composition discretization = true from my parameter file, it runs just fine and starts the computation as it should. This is all with the new patch (#4497).
Unfortunately STAMPEDE2 is offline until later this week so I can’t try it on there just yet, but hopefully this was the problem and I can get it to work.
STAMPEDE2 is back online now. I tried running the same set-up that I got to work on my local machine (same parameter file, input text files, and source code (#4497)), and it’s still getting hung up during or immediately after the second file read (for the compositional field). I can’t seem to reproduce this problem on my local machine. Any suggestions on how to debug on the cluster?
Are you sure you got the updated version from the PR (otherwise the code will hang) and compiled?
I would suggest you first try it with a small input file and see if that works. If not, you could add some print statements around the file loading to help identify how far you got.
I’m pretty sure, but I’m redownloading it and compiling it to check again. The job to run the new version of ASPECT and see if it works is in the queue, so hopefully I’ll find out soon.