Hi Leigh,
So, I don't use pHDF5 that much and have no experience with pHDF5 at
scales you are working with here. For the record, we use serial HDF5 and
'Poor Man's Parallel I/O'
(http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Mans_vs_Rich_Mans_Parallel_IO)
to achieve scalable I/O.
That said, I do know of some HDF5 properties you might try fiddling with
to affect aggregation (coalescing) of small writes. These, of course,
some at the expense of larger buffer allocations in HDF5 lib to
accommodate the aggregation...
First, try calling this...
H5Pset_small_data_block_size(hid_t fapl_id, hsize_t size )
with a large size, say 1-16 Megabytes.
Also, you might try calling this...
herr_t H5Pset_meta_block_size( hid_t fapl_id, hsize_t size )
with a large size. The value of it will really depend on how many
objects (e.g. datasets+groups+types+attributes+b-trees) you are creating
in a file. Again, play with value but maybe something on order of 1-4
megabytes would be good to try.
Looks like all the chunk 'caching' features are turned off when using
pHDF5. So, those functions can't help you.
Are there any MPI 'hints' that your system understands that maybe need
to get passed down to MPI to affect aggregation behavior of MPI-IO? I
can't recall specifically but I think you pass those to HDF5 via the
MPI_Info arg in HP5set_fapl_mpio.
Have you tried switching between fapl_mpio and fapl_mpiposix? The
difference is that fapl_mpio uses MPI-IO for actual disk I/O while
mpiposix uses sec2 I/O routines. In theory, MPI-IO is designed to do
just the kind of I/O aggregation you need. Sometimes taking MPI-IO out
of the equation (by using fapl_mpiposix) can help to shake out issues in
layers above, or below.
Note, all these are all file-access-property list settings found here..
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html
Good luck. Very interested to hear what you learn from trying any of
these.
Mark
···
On Sat, 2011-03-05 at 12:56, Leigh Orf wrote:
On Sat, Mar 5, 2011 at 12:14 PM, Tjf (mobile) <tfogal@sci.utah.edu> wrote:
> I'm not up on hdf5 internals, but I can't imagine any API would effectively deal with such small writes, because the os/disks aren't going to cope with them effectively.
>
> If hdf5 can coalesce writes, try enabling that. Otherwise, forward your data to a subset of nodes for writing, such that each write is large. Generally larger is better, but I would say shoot for 16 megs per write.
As I understand from Mark & Quincey when you write in collective mode,
it assigns writers and collects data to the writers so that the chunks
are larger, and aligns the data to the underlying FS stripe size (at
least with lustre, what I am using). However the details of this are a
mystery to me.
Leigh
>
> -tom
>
> Am Mar 4, 2011 um 5:03 PM schrieb Leigh Orf <leigh.orf@gmail.com>:
>
>> What is the size of a "write operation" with parallel hdf5? That
>> terminology comes up a lot on my sole source of guidance for lustre on
>> the machine I'm running on ( http://www.nics.tennessee.edu/io-tips )
>>
>> I am trying to choose ideal parameters for the lustre file system.
>>
>> I experienced abysmal performance with my first attempt at writing 1
>> file containing 3D data with 30,000 cores, and I want to choose better
>> parameters. After 11 minutes 62 GB had been written, and I killed the
>> job.
>>
>> Each 3D array that I write from a core is 435,600 bytes. I have my
>> chunk dimensions the same as my array dimension. Does that mean that
>> each core writes a chunk of data 435,600 bytes long? Would I therefore
>> wish to set my stripe size to 435,600 bytes? That is smaller than the
>> default of 1 MB.
>>
>> It seems that lustre performs best when each "write operation" is
>> large (say 32 MB) and the stripe size matches it. However our cores
>> each are writing comparatively much smaller chunks of data.
>>
>> I am going to see if the folks on the kraken machine can help me with
>> optimizing lustre, but want to understand as much as possible about
>> how pHDF5 works before I do.
>>
>> Thanks,
>>
>> Leigh
>>
>> --
>> Leigh Orf
>> Associate Professor of Atmospheric Science
>> Department of Geology and Meteorology
>> Central Michigan University
>> Currently on sabbatical at the National Center for Atmospheric
>> Research in Boulder, CO
>> NCAR office phone: (303) 497-8200
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511