This is somewhat related to my earlier queries about pHDF5 performance
on a lustre filesystem, but different enough to merit a new thread.
The cloud model I am using outputs two main types of 3D data:
snapshots and restart files.
The snapshot files are used for post-processing, visualization,
analysis etc. and I am using pHDF5 to write few (maybe even 1) file
containing 30,000 ranks of data. I am still working on testing this.
However, restart files (sometimes called checkpoint files), which
contain the minimum amount of data required to start the model up from
a particular state in time, are disposable and hence my main concern
with restart files is they be written and read as quickly as possible.
But I don't care what format they're in.
Because of issues with ghost zones and other factors which make arrays
overlap in model space, it is much simpler to have each rank write its
own restart file rather than trying to merge them together using
pHDF5. I started going down the pHDF5 route and decided it wasn't
Currently, the model defaults to one file per rank, but uses
unformatted fortran writes, i.e.:
etc. where each array is 3d (some 1d arrays are written as well).
With the current configuration I have, each restart file is approximately 12MB.
After reading through what literature I could find on lustre, I
decided that I would write no more than 3,000 files at one time, have
3,000 files per unique directory, and that I would set the stripe
count (the number of OSTs to stripe over) to 1. I set the strip size
to 32 MB which, in retrospect, was probably not an ideal choice given
each restart file size.
With this configuration, I wrote 353 GB of data, spanning 30,000
files, in about 6 minutes, getting an effective write bandwidth of
1.09 GB/s. A second try got better performance for no obvious reason,
getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
which from http://www.nics.tennessee.edu/io-tips is the maximum
(presumably aligned) expected performance.
I am assuming I am getting less than optimal results primarily because
the writes are unaligned. This brings me to my question: Should I
bother to rewrite the checkpoint writing/reading code using hdf5 in
order to increase performance? I understand with pHDF5 and collective
I/O, it automatically does aligned writes, presumably being able to
detect the strip size on the lustre filesystem (is this true?).
With serial HDF5, I see there is a H5Pset_alignment command. I also
assume that with serial HDF5, I would need to manually set the
alignment, as it defaults to unaligned writes. Would I benefit from
using H5Pset_alignment to the stripe size on the lustre filesystem?
My arrays are roughly 28x21x330 with some slight variation. 16 these 4
byte floating point arrays are written, giving approximately 12 MB per
So, as a rough guess, I am thinking of trying the following:
Set stripe size to 4 MB (4194304 bytes)
Try something like:
H5Pset_alignment(fapl, 1000, 4194304)
(I didn't set the second argument to 0 because I really don't want to
align my 4 byte integers etc, that comprise some of the restart data,
Chunk in Z only, so my chunk dimensions would be something like
28x21x30 (it's never been clear to me what chunk size to pick to
And keep the other parameters the same (1 stripe, and 3,000 files per
I guess what I'm mostly looking for is assurance that I will get
faster I/O going down this kind of route than the current way I am
doing unformatted I/O.
Thanks as always,
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200