performance of Phdf5 writes on a subset of MPI ranks

One more question about hdf5 parallel data set creation:
here<http://www.hdfgroup.org/hdf5-quest.html#par-nodata>two
alternatives are presenting for writing data from a subset of MPI
ranks.
I will be writting to a lustre filesystem, and and was wondering about the
performance differences between the two techniques. I have ordered data in a
3D array, and the domain decomposition is done over the 2nd and 3rd indices
(Fortran code). So each MPI rank has a 3d array. The domain is decomposed,
and the processors layed out, as a 2D grid. I want to write 2D planar slices
of data out with high frequency. If the slice is taken as j=const or k=const
(where j and k are the 2nd and 3rd indices respectively, the indeces which
the separate subdomains) then only a relatively small (compared to the total
#) but still O(10-100) MPI ranks need to write data. I would say in general
less than 50% but more than 10% of the MPI ranks will be involved in this
write. Should I do it as a collective write where I guess there is some
over-head as the MPI ranks are cycled through in a predetermined order, or
should I hammer the lustrefs with uncoordinated individual writes,
concurrently? Any insight here is appreciated.

Many thanks,
Izaak Beekman

···

===================================
(301)244-9367
UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu

Hi Izaak,

···

On Aug 12, 2011, at 11:32 AM, Izaak Beekman wrote:

One more question about hdf5 parallel data set creation: here two alternatives are presenting for writing data from a subset of MPI ranks. I will be writting to a lustre filesystem, and and was wondering about the performance differences between the two techniques. I have ordered data in a 3D array, and the domain decomposition is done over the 2nd and 3rd indices (Fortran code). So each MPI rank has a 3d array. The domain is decomposed, and the processors layed out, as a 2D grid. I want to write 2D planar slices of data out with high frequency. If the slice is taken as j=const or k=const (where j and k are the 2nd and 3rd indices respectively, the indeces which the separate subdomains) then only a relatively small (compared to the total #) but still O(10-100) MPI ranks need to write data. I would say in general less than 50% but more than 10% of the MPI ranks will be involved in this write. Should I do it as a collective write where I guess there is some over-head as the MPI ranks are cycled through in a predetermined order, or should I hammer the lustre fs with uncoordinated individual writes, concurrently? Any insight here is appreciated.

  I'm guessing that collective I/O will be a win still, but you may have to experiment a little...

    Quincey