H5FD_MPIO_INDEPENDENT vs H5FD_MPIO_COLLECTIVE

Hi,

I was wondering if someone could explain what goes on under the hood with
independent vs. collective I/O with parallel hdf5. Specific questions I
have:

With independent I/O, does each I/O rank open, write, close, and hand it off
to the next I/O rank, hence only one rank has access to a file at a given
time (no concurrency)?

With collective I/O, are I/O ranks writing concurrently to one file? If so,
can you control the number of concurrent accesses to a single file?

I have found with collective I/O, only a small subset of writers actually is
writing concurrently (much less than the total number of ranks) for tense of
thousands of cores. What controls this number? Also, how is data collected
to the I/O ranks? MPI_GATHER? It seems you could run the risk of running out
of memory if you are collecting large 3D arrays to only a few ranks on a
distributed memory machine.

I ask these questions because contrary to what I have been told should work,
I cannot get even marginally decent performance out of collective I/O on
lustre for large numbers of cores (30kcores writing to one file), and need
to try new approaches. I am hoping that parallel hdf5 can still be of use to
me rather than having to do my own MPI calls to collect and write, or just
doing tried & true one file per core.

Thanks,

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

With independent I/O, does each I/O rank open, write, close, and hand it off
to the next I/O rank, hence only one rank has access to a file at a given
time (no concurrency)?

no, there's concurrency at the HDF5 level. Sometimes too much
concurrency....

With collective I/O, are I/O ranks writing concurrently to one file? If so,
can you control the number of concurrent accesses to a single file?

HDF5 passes the collective request down to the MPI-IO library. It's
the lower MPI-IO library (often but not always ROMIO) that will select
how many concurrent readers/writers you can have.

I have found with collective I/O, only a small subset of writers actually is
writing concurrently (much less than the total number of ranks) for tense of
thousands of cores. What controls this number? Also, how is data collected
to the I/O ranks? MPI_GATHER? It seems you could run the risk of running out
of memory if you are collecting large 3D arrays to only a few ranks on a
distributed memory machine.

What platform are you on? ROMIO will select one processor per compute
node as an "i/o aggregator". So if you somehow have 30k cores on a
single machine, all the I/O goes through one MPI process (by default).

If you want to change that, you can set the hint "cb_config_list".
The full syntax is kind of weird, but you can set it to "*:2" or "*:4"
or however many processes per node you want to aggregate.

"cb_nodes" is a higher-level hint that just says "pick N of these". N
is by defualt the number of nodes (not processes), but you can select
lower or higher and ROMIO, in consultation with cb_config_list, will
pick that many.

I ask these questions because contrary to what I have been told should work,
I cannot get even marginally decent performance out of collective I/O on
lustre for large numbers of cores (30kcores writing to one file), and need
to try new approaches. I am hoping that parallel hdf5 can still be of use to
me rather than having to do my own MPI calls to collect and write, or just
doing tried & true one file per core.

Lustre is kind of a pain in the neck with regards to concurrent I/O.
please let me know the platform and MPI implementation you are using
and I'll tell you what you need to do to get good performance out of
it.

==rob

···

On Wed, Mar 30, 2011 at 11:50:48AM -0600, Leigh Orf wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA