This is the followup from our sysadmin. Bottom line is that we should just append to the statistics file.
root@nas-5-0:~/log# cat nas-5-0-11.log | grep rudolph | grep
/export/1/rudolph/ […] /statistics | grep 1048576 |
wc -l
94996
So 95,000 1MB writes, granted that’s only 26MB/sec, but it’s a very unfriendly load. That’s only
for that single hour. So 95GB to keep the most recent 50MB.
From my report:
base dir Operations IOPs % of IOPS Bandwidth bandwidth
/export/6/redacted 608943 174.98 72.84 2952249 2.76%
/export/1/rudolph 150316 43.19 17.98% 97018428 90.76%
So you had 17% of the IOPs and 90.76% of the bandwidth. We are of course talking to [redacted] as
well, we have a tweak for them that will reduce their IOP load by a factor of 500.
Something simple like a seek would reduce your workload by over 2000, and that’s for a single hour.
But rewriting the same file 2000 times per hour and wasting 95GB of I/O, while invalidating caching
by deleting, opening, and closing the file is a substantial load to achieve basically nothing.
The problem is you are keeping numerous NFSd busy with this continuous, but useless workload
(throwing away 99.95% of the writes) and then you get starved because all the NFSd are busy and when
you type ls on the head node there’s no NFSd to talk to.
It’s something that has never occurred to me would be a problem. So in
that time, we wrote an average of about 250kB once every other second.
We are talking about 26M per second, some 210 times more than your 250kB per second. Yes 250Kb/sec
is trivial… if sequential. Not so much is it’s 512 byte writes, that’s 488 random IO/sec which
will completely saturate 5 disks or so. Your homedir is on 4 disks.
Parallel file systems have write performance in the range of 100s of
MB/s today, so what we write sounds like a rather benign fraction.
I’m all for a parallel file system, so far nobody has wanted to pay for one. Our filesystem can do
100s of MB/sec today, but not using horribly inefficient patterns. For instance:
root@nas-5-0:~/log# cat nas-5-0-11.log | grep rudolph | grep 1048576 | grep 1582570302 | wc -l
48
So during a single random second I picked your code did 48 1MB writes to a single file, while of
course handling all the rest of the I/O of the cluster. But that 48MB/sec was just so that your
code would append 281 bytes to the end of a file.
So I’m having a cognitive disconnect between what I think is true and
what appears to be a real problem – which usually indicates a learning
opportunity for myself. Can you (or your sysadmin) clarify how that
accounted for 90% of the bandwidth?
Well this is only the problem because peloton is a big system and you are one of many users. One of
which is generating quite a few IOPs doing similarly incredibly inefficient I/O. Their workload is
writing 100 bytes to a file, but writing to 100 of files per second. My suggested tweak is just to
add a python | buffer -s 512k which will reduce their I/O load by a factor of 500. As bad as
that is it’s a factor of 10x more efficient than what you are doing.
We seem to have a factor of 200 or so between your mentioned trivial load and what we are sustaining
just to keep one small file up to date, while keeping on the last of 5500 rewrites of the same file.
So 250KB every other second is great, 200x that for hours with nothing to show for it not so much.
Or was it that we write those bytes in many small write requests instead
of one big data blob? That should be relatively easy to fix by writing
into an ostringstream first, and the outputting everything as one long
string.
To be several 1000 times more efficient I’d recommend instead of rewriting a 50MB file to write 281
bytes:
-
buffer 64KB and append to a file (a single 64 KB write instead of 50 1MB writes 233 times)
That would be 177,000 times more efficient!
-
buffer 1MB to a file, so after a run there would d be 50 1MB files, write a script
that’s sums all files to a single file when done. That way even if the newest file gets
corrupted that you still have the previous 49 files intact. Something like:
#!/bin/bash
cat statistics.* > bigstatististics
So any of seek, append, or new files would reduce the I/O by 1000s of times.