Filesystem issues with statistics file

We are having some performance issues on our cluster and the sysadmin has provided the following input about the way in which ASPECT is updating the ‘statistics’ file. I was wondering if the developers could weight in on whether this is indeed the case and what the best course of action would be to resolve the problem?

I looked at the logs for nas-5-0, where your homedir is.

Between 10am and 11am your code appended 281 bytes to a file 2000 times.

Normally that wouldn’t be a big deal, but your code has an impressively
inefficient way to append to files.

Normally a code would fopen the file (if not open), seek to the end, and append.

But in your case a new file would be created, the entire contents of the file
would be written, then the file would be moved in place.

The file in question is:
~/blah/statistics

Over that hour you used 90% of the bandwidth, user (redacted) was using 75% of the
IOPs which is what sensitized the system to your workload. I’ll be sending
email to them next.

Yes, we currently rewrite the statistics file from scratch every timestep (instead of appending).

Does your cluster have a file system that is usable for (parallel) output? I always put the output/ folder to the high performance file system (often called /scratch/ and which is not home). The statistics file is inside that folder and writing a couple of kbytes per timestep should not be expensive.
Alternatively, we could figure out a way to put the statistics file on a local filesystem of processor 0.

It’s something that has never occurred to me would be a problem. So in
that time, we wrote an average of about 250kB once every other second.
Parallel file systems have write performance in the range of 100s of
MB/s today, so what we write sounds like a rather benign fraction.

So I’m having a cognitive disconnect between what I think is true and
what appears to be a real problem – which usually indicates a learning
opportunity for myself. Can you (or your sysadmin) clarify how that
accounted for 90% of the bandwidth?

Or was it that we write those bytes in many small write requests instead
of one big data blob? That should be relatively easy to fix by writing
into an ostringstream first, and the outputting everything as one long
string.

Best
W.

Thanks Timo. Maybe others have not run into this problem because they’re not doing long time integrations? I’m trying to run a model to statistically steady state and after 115,000 timsteps, this file is about 32MB, and it seems to be increasingly problematic to keep re-writing it over and over again. We do not have a parallel filesystem - our output directory is on a NFS share. I think that this is likely the case on a lot of mid-scale clusters with a few thousand cores. We do have a dedicated local /scratch drive on each machine. One solution might be to add an option to only re-write the statistics file every X timesteps? I would set this to the checkpoint spacing by default.

Wolfgang,
I followed up with our sysadmin and will post his reply (or we can discuss by email). However, I don’t follow your logic that aspect wrote 250 kB every other second. The statistics file is 32 MB and the entire thing gets written every timestep. It got written 2000 times in a one hour period, so this consumes 17 MB/s. I agree that in principle this shouldn’t be a huge problem, but there are other jobs running simultaneously and this is not a reasonable I/O pattern on a shared resource.

Max

I followed up with our sysadmin and will post his reply (or we can
discuss by email). However, I don’t follow your logic that aspect wrote
250 kB every other second. The statistics file is 32 MB and the /entire
thing/ gets written every timestep. It got written 2000 times in a one
hour period, so this consumes 17 MB/s.

I went by the statement you had originally posted:
“Between 10am and 11am your code appended 281 bytes to a file 2000
times.”
That added up to 560 kB (2000 times 281 bytes) at the end of the of a
one-hour period. The average file size would then have been 280 kB over
the course of that hour, which is written every other second.

Your 32 MB don’t mesh with the 281 x 2000, but who cares. I see that the
number you have is ~100 times larger indeed.

I agree that in principle this
shouldn’t be a huge problem, but there are other jobs running
simultaneously and this is not a reasonable I/O pattern on a shared
resource.

I totally agree. It’s something that I don’t think any of us ever
thought about, but that doesn’t make it a valid point.

I think that the right way to deal with it is to only append the data,
not write the whole file.

Best
W.

I see that the statistics file gets written using TableHandler::write_text() from deal.ii. Is the right thing to do to make the changes in deal.ii? TableHandler::write_text() could take additional arguments to determine whether to append and if so starting from which row the appended output needs to be written.

This is the followup from our sysadmin. Bottom line is that we should just append to the statistics file.

root@nas-5-0:~/log# cat nas-5-0-11.log | grep rudolph | grep
/export/1/rudolph/ […] /statistics | grep 1048576 |
wc -l
94996

So 95,000 1MB writes, granted that’s only 26MB/sec, but it’s a very unfriendly load. That’s only
for that single hour. So 95GB to keep the most recent 50MB.

From my report:
base dir Operations IOPs % of IOPS Bandwidth bandwidth /export/6/redacted 608943 174.98 72.84 2952249 2.76%
/export/1/rudolph 150316 43.19 17.98% 97018428 90.76%

So you had 17% of the IOPs and 90.76% of the bandwidth. We are of course talking to [redacted] as
well, we have a tweak for them that will reduce their IOP load by a factor of 500.

Something simple like a seek would reduce your workload by over 2000, and that’s for a single hour.

But rewriting the same file 2000 times per hour and wasting 95GB of I/O, while invalidating caching
by deleting, opening, and closing the file is a substantial load to achieve basically nothing.

The problem is you are keeping numerous NFSd busy with this continuous, but useless workload
(throwing away 99.95% of the writes) and then you get starved because all the NFSd are busy and when
you type ls on the head node there’s no NFSd to talk to.

It’s something that has never occurred to me would be a problem. So in
that time, we wrote an average of about 250kB once every other second.

We are talking about 26M per second, some 210 times more than your 250kB per second. Yes 250Kb/sec
is trivial… if sequential. Not so much is it’s 512 byte writes, that’s 488 random IO/sec which
will completely saturate 5 disks or so. Your homedir is on 4 disks.

Parallel file systems have write performance in the range of 100s of
MB/s today, so what we write sounds like a rather benign fraction.

I’m all for a parallel file system, so far nobody has wanted to pay for one. Our filesystem can do
100s of MB/sec today, but not using horribly inefficient patterns. For instance:

root@nas-5-0:~/log# cat nas-5-0-11.log | grep rudolph | grep 1048576 | grep 1582570302 | wc -l
48

So during a single random second I picked your code did 48 1MB writes to a single file, while of
course handling all the rest of the I/O of the cluster. But that 48MB/sec was just so that your
code would append 281 bytes to the end of a file.

So I’m having a cognitive disconnect between what I think is true and
what appears to be a real problem – which usually indicates a learning
opportunity for myself. Can you (or your sysadmin) clarify how that
accounted for 90% of the bandwidth?

Well this is only the problem because peloton is a big system and you are one of many users. One of
which is generating quite a few IOPs doing similarly incredibly inefficient I/O. Their workload is
writing 100 bytes to a file, but writing to 100 of files per second. My suggested tweak is just to
add a python | buffer -s 512k which will reduce their I/O load by a factor of 500. As bad as
that is it’s a factor of 10x more efficient than what you are doing.

We seem to have a factor of 200 or so between your mentioned trivial load and what we are sustaining
just to keep one small file up to date, while keeping on the last of 5500 rewrites of the same file.

So 250KB every other second is great, 200x that for hours with nothing to show for it not so much.

Or was it that we write those bytes in many small write requests instead
of one big data blob? That should be relatively easy to fix by writing
into an ostringstream first, and the outputting everything as one long
string.

To be several 1000 times more efficient I’d recommend instead of rewriting a 50MB file to write 281
bytes:

  1. buffer 64KB and append to a file (a single 64 KB write instead of 50 1MB writes 233 times)
    That would be 177,000 times more efficient!

  2. buffer 1MB to a file, so after a run there would d be 50 1MB files, write a script
    that’s sums all files to a single file when done. That way even if the newest file gets
    corrupted that you still have the previous 49 files intact. Something like:
    #!/bin/bash
    cat statistics.* > bigstatististics

So any of seek, append, or new files would reduce the I/O by 1000s of times.

Yes, I think I get that by now – please say thanks to your sysadmin for
figuring out all of these numbers: that’s precisely the thing that’s
needed to pinpoint a problem nobody knew we had!

I’ll have to unravel a trip later this week to Italy that isn’t going to
happen because of the corona virus, but I think I should be able to get
to that issue in the next couple of days. It’s bee a couple of months
that I contributed anything noteworthy to ASPECT and that might make for
a nice project :slight_smile:

Cheers
W.

Thank you Wolfgang. I would also be happy to take a crack at this if you’d rather direct your efforts elsewhere. I already did a ‘quick fix’ of adding a parameter to control the frequency with which this file gets updated.

I’ve often found that for simulations that run in the minutes per time step,
you really want every output file because you want to see how things progress.
So I’m going to add the ‘append’ option to ASPECT.

Best
W.

I would suggest writing the whole file after a resume and after that only append the last line. The question is if it is okay to repeatedly open and close the file as well?