Problems with the core driver and memory utilization

A major part of my I/O strategy for massively parallel supercomputers (such
as the new Blue Waters Cray XE6 machine) is doing buffered file writes. It
turns out that our cloud model only takes up a small fraction of the
available memory on a node, so we can buffer dozens of files to memory
before we have to hit the file system, dramatically improving I/O wallclock
usage.

I am getting some strange behavior with the core driver, however. On some
machines and with some compilers, it works great. One problem that I am
having consistently on Blue Waters using the Cray compilers is that the
amount of memory being chewed up at every h5dwrite is way, way larger than
the actual size of the data arrays being written. Because I have limited
access to the machine right now, I have not tested it with other compilers.

Specific example of odd behavior:

First, here is how the data is stored in each file. The output below only
covers two time levels (there are many more in the file). Note: The group
00000 is for time = 0 seconds, the group 00030 is for time = 30 seconds,
etc.

h2ologin1:% h5ls -rv cm1out.00000_000000.cm1hdf5 | grep 3d

/00000/3d Group
/00000/3d/dbz Dataset {250/250, 60/60, 60/60}
/00000/3d/dissten Dataset {250/250, 60/60, 60/60}
/00000/3d/khh Dataset {250/250, 60/60, 60/60}
[...]
/00020/3d Group
/00020/3d/dbz Dataset {250/250, 60/60, 60/60}
/00020/3d/dissten Dataset {250/250, 60/60, 60/60}
/00020/3d/khh Dataset {250/250, 60/60, 60/60}
[...]

and so on.

Data is gathered to one of the cores on the 16 core shared memory module so
only one core per module is buffering to memory and writing to disk. Time
groups are created, data is written, groups are closed, new groups are
created, etc. This process goes on until I decide we've used up enough
memory, and I close the final groups and finally the file with a call to
h5fclose. Backing store is on, so when the file is closed, its contents are
flushed to disk. As I understand it, once this is done, all memory that the
file occupied in memory should be freed.

The problem: In a recent simulation, I wrote 41 3d fields per time level.
That should mean each time level should take up the following number of
bytes:

250*60*60*41*4 = 150 MB (roughly).

As part of my code, I query the /proc/meminfo (these machines run Linux)
file on each node to see how much memory is being used / is available, and
output the values after each buffer to memory. I keep track of what I call
global_free which is MemFree + Buffers + Cached, and do a MPI_REDUCE,
picking the smallest value (realizing there will be small variations in
memory available on each node. However, the results would be nearly
identical if I just calculated this on any given node)

With no compression and no chunking, I see the following value of
global_free after each buffered write, which, remember, should be using up
around 150 MB:

0 global_free = 60268020
0 global_free = 57186776
0 global_free = 53716128
0 global_free = 51117500
0 global_free = 48013960
0 global_free = 44306108

etc. etc.

Those values are in kB - so, for instance, we went from 60.2 GB to 57.1 GB
(chewed up about 3GB) after writing 150 MB of data!

I do not see this behavior on all machines, and I'm not sure it's a hdf5
bug (could be a Cray bug ... and we have submitted a bug report with Cray).
But, because I have seen flakiness with the core driver beyond this
example, and there is precious little documentation on it, I wanted to ask
whether anyone had any ideas on how to troubleshoot this problem. Note,
this is with version 1.8.8, which is the latest version installed on the
Blue Waters machine.

Note that once the file is flushed to disk, its size is exactly what it
should be based upon the size of the arrays and the data is exactly what it
should be.

Finally, when I comment out only the h5dwrite command in the 3D write
subroutine, and leave everything else the same, memory is essentially flat,
meaning it's not a memory leak on my part. I've experimented with and
without chunking, and with and without compression. Turning gzip
compression on (with chunking of course) seems to take up a little less
memory per buffered write, but still way more than it should.

Here is how I am initializing the files:

backing_store = .true.
blocksize = 4096
call h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, ierror); check_err(ierror)
call h5pset_fapl_core_f(plist_id, blocksize, backing_store, ierror);
check_err(ierror)
call
h5fcreate_f(trim(filename),H5F_ACC_TRUNC_F,file_3d_id,ierror,access_prp=plist_id);
check_err(ierror)
call h5pclose_f(plist_id, ierror); check_err(ierror)

I am not calling h5p_set_alignment and cannot recall why I chose 4096 bytes
for a memory increment size.

Thanks for any pointers.

Leigh

Hi Leigh,
  Sorry for the additional delay, I'm a little swamped with some contractual stuff and SC-related issues today. I'll get something back to you tomorrow.

  Quincey

···

On Nov 9, 2012, at 11:36 AM, Leigh Orf <leigh.orf@gmail.com> wrote:

A major part of my I/O strategy for massively parallel supercomputers (such as the new Blue Waters Cray XE6 machine) is doing buffered file writes. It turns out that our cloud model only takes up a small fraction of the available memory on a node, so we can buffer dozens of files to memory before we have to hit the file system, dramatically improving I/O wallclock usage.

I am getting some strange behavior with the core driver, however. On some machines and with some compilers, it works great. One problem that I am having consistently on Blue Waters using the Cray compilers is that the amount of memory being chewed up at every h5dwrite is way, way larger than the actual size of the data arrays being written. Because I have limited access to the machine right now, I have not tested it with other compilers.

Specific example of odd behavior:

First, here is how the data is stored in each file. The output below only covers two time levels (there are many more in the file). Note: The group 00000 is for time = 0 seconds, the group 00030 is for time = 30 seconds, etc.

h2ologin1:% h5ls -rv cm1out.00000_000000.cm1hdf5 | grep 3d

/00000/3d Group
/00000/3d/dbz Dataset {250/250, 60/60, 60/60}
/00000/3d/dissten Dataset {250/250, 60/60, 60/60}
/00000/3d/khh Dataset {250/250, 60/60, 60/60}
[...]
/00020/3d Group
/00020/3d/dbz Dataset {250/250, 60/60, 60/60}
/00020/3d/dissten Dataset {250/250, 60/60, 60/60}
/00020/3d/khh Dataset {250/250, 60/60, 60/60}
[...]

and so on.

Data is gathered to one of the cores on the 16 core shared memory module so only one core per module is buffering to memory and writing to disk. Time groups are created, data is written, groups are closed, new groups are created, etc. This process goes on until I decide we've used up enough memory, and I close the final groups and finally the file with a call to h5fclose. Backing store is on, so when the file is closed, its contents are flushed to disk. As I understand it, once this is done, all memory that the file occupied in memory should be freed.

The problem: In a recent simulation, I wrote 41 3d fields per time level. That should mean each time level should take up the following number of bytes:

250*60*60*41*4 = 150 MB (roughly).

As part of my code, I query the /proc/meminfo (these machines run Linux) file on each node to see how much memory is being used / is available, and output the values after each buffer to memory. I keep track of what I call global_free which is MemFree + Buffers + Cached, and do a MPI_REDUCE, picking the smallest value (realizing there will be small variations in memory available on each node. However, the results would be nearly identical if I just calculated this on any given node)

With no compression and no chunking, I see the following value of global_free after each buffered write, which, remember, should be using up around 150 MB:

0 global_free = 60268020
0 global_free = 57186776
0 global_free = 53716128
0 global_free = 51117500
0 global_free = 48013960
0 global_free = 44306108

etc. etc.

Those values are in kB - so, for instance, we went from 60.2 GB to 57.1 GB (chewed up about 3GB) after writing 150 MB of data!

I do not see this behavior on all machines, and I'm not sure it's a hdf5 bug (could be a Cray bug ... and we have submitted a bug report with Cray). But, because I have seen flakiness with the core driver beyond this example, and there is precious little documentation on it, I wanted to ask whether anyone had any ideas on how to troubleshoot this problem. Note, this is with version 1.8.8, which is the latest version installed on the Blue Waters machine.

Note that once the file is flushed to disk, its size is exactly what it should be based upon the size of the arrays and the data is exactly what it should be.

Finally, when I comment out only the h5dwrite command in the 3D write subroutine, and leave everything else the same, memory is essentially flat, meaning it's not a memory leak on my part. I've experimented with and without chunking, and with and without compression. Turning gzip compression on (with chunking of course) seems to take up a little less memory per buffered write, but still way more than it should.

Here is how I am initializing the files:

backing_store = .true.
blocksize = 4096
call h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, ierror); check_err(ierror)
call h5pset_fapl_core_f(plist_id, blocksize, backing_store, ierror); check_err(ierror)
call h5fcreate_f(trim(filename),H5F_ACC_TRUNC_F,file_3d_id,ierror,access_prp=plist_id); check_err(ierror)
call h5pclose_f(plist_id, ierror); check_err(ierror)

I am not calling h5p_set_alignment and cannot recall why I chose 4096 bytes for a memory increment size.

Thanks for any pointers.

Leigh
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org