degraded performance with core driver over time

[also sent to help@hdfgroup.org]

I have been heavily utilizing the core driver for serial HDF5, lately on
the Blue Waters supercomputer. I am having some perplexing performance
issues that are causing me to seriously reconsider my use of this approach,
which allows hdf5 files to be buffered to memory and flushed to disk at a
later time (avoiding hitting the filesystem frequently).

I am doing large-scale cloud modeling, saving (primarily) three-dimensional
floating point arrays in equally-spaced time intervals (say, every 5
seconds of model integration). My model writes thousands of HDF5 files
concurrently, one core from each shared-memory node tasked with the job of
making the hdf5 calls. After, say, 50 buffered writes, the model closes the
file (backing store is on) and the files are flushed to disk. I am not
having problems with the actual flush to disk performance, but the
buffering writes themselves.

Lately I have run into a perplexing problem. I have been saving data much
more frequently than usual, as I need very high temporal resolution for the
current study I am doing. What I am seeing: Initially, shortly after the
model starts running, the I/O section of the code where each I/O core
writes to memory (using the core driver), performance is very good, what
you would expect when doing I/O that is all in memory.

After the model has run a while and done a couple of flushes to disk, I
noticed a couple of things. First, the amount of memory that is being
utilized by HDF5 increases with time, even though we have ostensibly freed
it up after the files have been written to disk. I keep tabs on
/proc/meminfo on each node and look at things like available memory, active
memory, buffered memory utilization, etc. What I have found is that a whole
lot of memory is never completely freed up after files are written to disk.
I've also noticed that there is a huge memory overhead with the core
driver. I may be buffering, say, 4 GB worth of 3D floating point arrays to
memory, but something like 5-10 times more memory is being used by HDF5
(the model itself allocates all of its memory up front - so aside from,
perhaps, MPI there is no alloc/dealloc going on in the model other than
what hdf5 is doing). Even though I see some memory recovery after flushing
to disk - and the Linux kernel may be partly at fault here - I have run
into OOM issues where the model is killed because I have run out of memory
(and this is after buffering, say, 4 GB of writes to memory when the
machine has 64 GB of memory to play with). The only way I've figured around
this problem is to buffer a lot less data to memory than I really want to.
That is one major issue I am having.

Now, on the performance issue. I have written an unpublished technical
document that describes my I/O strategy. Page 4 is pseudocode for the I/O
cycle I use (see here: http://orf5.com/bw/cm1tools-March2013.pdf).
Essentially, I create a new top level group (zero padded integer which
represents the time we are at in the model), a subgroup (called 3D - right
now there is only one subgroup here) - and finally, a subgroup below that
that is named after the actual data being stored - and there are usually 10
or so floating point arrays. Over time, of course, the number of groups
grows - but to what I think are manageable numbers - we're talking hundreds
of groups per file, not tens of thousands.

Over time, the time it takes to do buffered writes (I assume this is
happening in H5Dwrite) dramatically increases. I have not done profiling,
but I have watched this, as I send unbuffered "heartbeat" information to
standard out during the model simulation. It takes, say, 4-5 times longer
to do I/O (and remember, this is the buffered write section! We are not
actually writing to disk here!) as the model progresses. This is
unacceptable, as time is a precious commodity on a supercomputer!

I do have chunking and gzip compression (level 1) turned on for my floating
point arrays. I am choosing what I think are logical chunking parameters. I
simply choose as the chunk dimensions the original array dimensions that
each core operates on. So, if I am writing an array on a node that that is
160x160x100 and I have 16 cores (in a 4x4 orientation), I just collect data
to the I/O core and set the chunking dimensions to be 40x40x100.

I am getting to the point where I am seriously considering just doing my
own buffering and not using the core driver. This way, I would allocate all
of the buffer space up front with a regular F95 ALLOCATE call, and then at
I/O time, just loop through and blast everything to disk, using the same
exact group structure that I currently use.

Before I go through the trouble of doing this, however, I am wanting to see
whether there is a way around my problems (that doesn't involve doing
exhaustive profiling; I just don't have time to figure out how to profile
HDF5 right now), or at least some sort of confidence that my problems won't
continue once I stop using the core driver. The core driver is really neat
and I like it but these weird issues with memory bloat and now strange
performance issues that are a function of how long the model has been
running have me wishing to try something else.

FYI, I have tried h5_garbage_collect with no discernible performance
change. Also:

blocksize = 1024*1024*100
CALL h5pset_fapl_core_f(plist_id, blocksize, backing_store, ierror);

I have played with different blocksizes before. Because I am using
compression and historically I have not always saved the same number of
time levels to each file, I am not always sure how large my data will be -
so I have chosen a block size of 100 MB which seems like a good compromise
between being too large and too small - but I really don't completely
understand the function of this setting and perhaps it has something to do
with the memory issues (which is another reason why I am leaning towards
doing my own buffering since this is not something that needs to be set
with the standard driver). I did notice if I chose a much larger block
size, I ended up with huge amounts of padding tacked on to the end of the
written hdf5 file.

Leigh

ยทยทยท

--
Leigh Orf
Chair, Department of Earth and Atmospheric Sciences
Central Michigan University