It works for me using an interactive development session on a single node on STAMPEDE2, but I’m still getting the hangup every time I submit it as a batch job. I’ve tried 1, 2, 4, and 8 nodes for the batch jobs to see. Am I doing something wrong in the batch script? I’ve pasted it below. The executable that I’m copying to run is the same one that worked for me in the interactive development session.
It seems that the problem arises when I try to run it in parallel. It works when I run it as a single-threaded application (./aspect parameters.prm), but not in parallel (either ibrun aspect parameters.prm or mpirun -n 48 aspect parameters.prm). There is a parallel debugger on the TACC system (DDT Parallel Debugger - TACC User Portal), so I will try to figure out how to use that.
I’m not sure if this is the issue, but when I run models on Stampede2 I do not copy the executable to the $SCRATCH directory. That may explain why it works when you run an interactive session, assuming you are running the executable from its original directory.
Here is my workflow for running models on Stampede2:
Compile deal.II/ASPECT in the work directory
Create a folder in work containing the model I want to run + any required files (initial composition ascii file, etc) + sbatch submission script.
Copy the directory to $SCRATCH.
cd to the folder in $SCRATCH and submit the job.
Can you try giving this workflow a shot? An example sbatch script is also now located at the base at the bottom of the ASPECT Stampede2 instructions page.
Sam:
Yes, that’s the right approach. One way or the other, you’ll have to figure out where it hangs. A debugger is a good way to do that; other ways would be to insert debug print statements that tell you that execution (on a given process) has passed a certain point. If you have enough of those throughout your code, you can narrow down where it hangs. But the debugger is definitely the more comfortable way, because you don’t have to know the code and because you don’t have to modify it either.
Best
W.
When I ran it in an interactive session, I was running the executable and parameter file from the $SCRATCH directory that had been copied over by my batch script, to try to best reproduce the setup. When doing that, it worked as a single-threaded program but not as a parallel one (all in an interactive session).
I’ve made some progress, although I still don’t know where the hang-up has been occurring. I’ve figured out that it will run fine with no hangup if I use fewer cores per node. I haven’t found out exactly where the magic number is, but it works fine with 32 cores/node and hangs up with 48 cores/node. This is true for both an interactive session and a normal batch job.
I’ve also tried out the DDT debugger on STAMPEDE2. It works fine for setups that don’t hang up (i.e. <=32 nodes/core), but when I try 48 nodes/core, I get the following error:
A debugger disconnected prematurely:
remote host closed the connection
One or more processes may be unusable.
This can occur normally at the end of a job on some systems, or it may occur if your batch system has set resource limits (e.g. maximum time or memory) and one or more of those limits was exceeded.
My guess is that it’s some sort of memory overload, since using fewer cores per node gives more memory per core. For now, I’m going to keep going with fewer cores per node, since that seems to work just fine. But if it would be useful to the ASPECT community to know where the hang-up was happening, I’m happy to keep digging further.
Sam: Yes, this sounds like one of the processes is running out of memory. These are quite difficult to catch errors because generally the operating system gives a program more memory than is actually available in the hope that it will not actually use all that has been requested; if the program actually tries to write into the so-obtained memory, but no memory is actually available, the OS just terminates the job without notice to the program itself. One doesn’t even get the chance to print an error message, nor the ability to send a good-bye to the other jobs in an MPI universe – they will just keep listening for replies to their requests until they finally time out.
I don’t know what we could do differently, but I’m glad you found the reason and the source of the problem!
You can check if you are low on memory by getting the running job information and then logging into one of the compute nodes using ssh. There you can check memory usage with “top”.
After Enable data sharing across processes by bangerth · Pull Request #4086 · geodynamics/aspect · GitHub is merged, loading a large ASCII file no longer requires 2x(number of MPI ranks per node)xfilesize amount of memory but maybe 1-2xfilesize. That will help you. But I would suggest that you try out one thing at a time…