collective I/O disabled?

Hi

We've got a workload here where we collectively write a bunch of
double-precision floating points in parallel to a checkpoint file. In
the checkpoint file the data is stored as double precision. I can
profile HDF5 and see that collective I/O optimizations are kicking in
and we are doing great.

This workload also writes out a single precision file, and while these
writes are also collective, the trace strongly suggests that HDF5 is
not actually writing the data collectively.

What are the collective I/O constraints? I think I remember seeing
someone discuss that recently, but am not able to find the message.
Will HDF5, for example, see that the memory type and file type are
different and then take a non-collective path?

Thanks
==rob

···

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

Hi Rob,

A quick sanity check: are all tasks in the MPI communicator
participating in the single-precision write? That is, are they all
making hyperslab selections and H5Dwrite calls? Even if a task isn't
writing to the dataset, it still needs to make an empty hyperslab
selection and participate in the H5Dwrite call. This stumped me a
while back, and actually, you may have been the one who enlightened me
about the empty selections, so apologies if this is stale advice.

Mark

···

On Wed, Sep 9, 2009 at 11:04 AM, Robert Latham<robl@mcs.anl.gov> wrote:

Hi

We've got a workload here where we collectively write a bunch of
double-precision floating points in parallel to a checkpoint file. In
the checkpoint file the data is stored as double precision. I can
profile HDF5 and see that collective I/O optimizations are kicking in
and we are doing great.

This workload also writes out a single precision file, and while these
writes are also collective, the trace strongly suggests that HDF5 is
not actually writing the data collectively.

What are the collective I/O constraints? I think I remember seeing
someone discuss that recently, but am not able to find the message.
Will HDF5, for example, see that the memory type and file type are
different and then take a non-collective path?

Thanks
==rob

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi mark. good suggestion, but all processes are indeed part of the
MPI communicator.

In fact, we can narrow things down a bit: if we do this single
precision write to the dataset in double precision, even though we
write 2x more data, performance is much faster, and I can confirm with
traces that collective I/O does kick in.

It does look to me like HDF5 takes the independent I/O path when
memory and file types are different, but I haven't been able to find
where in the code that decision is made (and better still, how I might
be able to convince HDF5 otherwise :> )

==rob

···

On Wed, Sep 09, 2009 at 11:10:27AM -0700, Mark Howison wrote:

Hi Rob,

A quick sanity check: are all tasks in the MPI communicator
participating in the single-precision write? That is, are they all
making hyperslab selections and H5Dwrite calls? Even if a task isn't
writing to the dataset, it still needs to make an empty hyperslab
selection and participate in the H5Dwrite call. This stumped me a
while back, and actually, you may have been the one who enlightened me
about the empty selections, so apologies if this is stale advice.

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

Hi Rob,

···

On Sep 9, 2009, at 1:39 PM, Robert Latham wrote:

On Wed, Sep 09, 2009 at 11:10:27AM -0700, Mark Howison wrote:

Hi Rob,

A quick sanity check: are all tasks in the MPI communicator
participating in the single-precision write? That is, are they all
making hyperslab selections and H5Dwrite calls? Even if a task isn't
writing to the dataset, it still needs to make an empty hyperslab
selection and participate in the H5Dwrite call. This stumped me a
while back, and actually, you may have been the one who enlightened me
about the empty selections, so apologies if this is stale advice.

Hi mark. good suggestion, but all processes are indeed part of the
MPI communicator.

In fact, we can narrow things down a bit: if we do this single
precision write to the dataset in double precision, even though we
write 2x more data, performance is much faster, and I can confirm with
traces that collective I/O does kick in.

It does look to me like HDF5 takes the independent I/O path when
memory and file types are different, but I haven't been able to find
where in the code that decision is made (and better still, how I might
be able to convince HDF5 otherwise :> )

  Yes, you are correct - HDF5 will perform independent I/O when the file and memory datatypes require conversion. The H5D_ioinfo_init() routine determines whether to use the H5D_select_read/write routines (which can be collective), or to use the H5D_scatgath_read/write routines (which are always independent).

  Quincey

In this case i'm seeing a lot of 512 k writes, but I'd like to see
much bigger (at least 4 MB) writes. Do I want to adjust the
"vec_size" property, the "tconv_buf" property, or something else
entirely ?

Can I override HDF5's choice here and force it to take a collective
I/O path?

Thanks
==rob

···

On Wed, Sep 09, 2009 at 01:54:58PM -0500, Quincey Koziol wrote:

  Yes, you are correct - HDF5 will perform independent I/O when the file
and memory datatypes require conversion. The H5D_ioinfo_init() routine
determines whether to use the H5D_select_read/write routines (which can
be collective), or to use the H5D_scatgath_read/write routines (which are
always independent).

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

Hi Rob,

  Yes, you are correct - HDF5 will perform independent I/O when the file
and memory datatypes require conversion. The H5D_ioinfo_init() routine
determines whether to use the H5D_select_read/write routines (which can
be collective), or to use the H5D_scatgath_read/write routines (which are
always independent).

In this case i'm seeing a lot of 512 k writes, but I'd like to see
much bigger (at least 4 MB) writes. Do I want to adjust the
"vec_size" property, the "tconv_buf" property, or something else
entirely ?

  You probably want to change the conversion buffer size with H5Pset_buffer(). Unless you are generating very complex, disjoint selections and then changes the vector size with H5Pset_hyper_vector_size() might help. (Both of these are only applicable to independent I/O, BTW)

Can I override HDF5's choice here and force it to take a collective
I/O path?

  Unfortunately no, since there's a type conversion that needs to occur. I suppose it's possible that each processes amount of dataset is small enough that it could type convert into a buffer and then perform a collective I/O operation, but that seems somewhat unusual and we don't attempt to detect that currently.

  Quincey

···

On Sep 9, 2009, at 3:02 PM, Robert Latham wrote:

On Wed, Sep 09, 2009 at 01:54:58PM -0500, Quincey Koziol wrote:

In this case i'm seeing a lot of 512 k writes, but I'd like to see
much bigger (at least 4 MB) writes. Do I want to adjust the
"vec_size" property, the "tconv_buf" property, or something else
entirely ?

  You probably want to change the conversion buffer size with
H5Pset_buffer(). Unless you are generating very complex, disjoint
selections and then changes the vector size with
H5Pset_hyper_vector_size() might help. (Both of these are only
applicable to independent I/O, BTW)

Thanks for the pointers. I think this data type falls under "complex,
disjoint" when viewed on a per-process basis. The collective I/O
optimizations are so important that even though it's twice as much
data to write out doubles, the writes go much much faster. We'll see
what the tuning parameters can do.

Can I override HDF5's choice here and force it to take a collective
I/O path?

Unfortunately no, since there's a type conversion that needs to
occur. I suppose it's possible that each processes amount of
dataset is small enough that it could type convert into a buffer and
then perform a collective I/O operation, but that seems somewhat
unusual and we don't attempt to detect that currently.

We solve this in pnetcdf ... with lots of buffer copies :>

If the user-requested datatype is non-contiguous and we are doing a
write, we MPI_PACK() that type into a buffer, then MPI_UNPACK() that
buffer into a contiguous buffer. After that, we type-convert into a
third buffer and use that third buffer for the (collective) MPI write.

Reading is the reverse: collective read into a buffer, type convert,
MPI_Pack() into a buffer, MPI_Unpack() into the user's desired type.

Earlier this year we worked on something a little more sophisticated
and a bit more higher performing: Rob Ross extracted the MPI type
processing into a standalone library that things such as
parallel-netcdf, HDF5 and ROMIO can use to rip through MPI types more
efficiently and apply transformations on the data.

http://www.mcs.anl.gov/~robl/papers/ross_datatype_lib.pdf

Now, I'm not suggesting we start re-working HDF5 type processing
tomorrow, but when the time comes, we've got options.

==rob

···

On Wed, Sep 09, 2009 at 04:27:41PM -0500, Quincey Koziol wrote:

On Sep 9, 2009, at 3:02 PM, Robert Latham wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Rob,

In this case i'm seeing a lot of 512 k writes, but I'd like to see
much bigger (at least 4 MB) writes. Do I want to adjust the
"vec_size" property, the "tconv_buf" property, or something else
entirely ?

  You probably want to change the conversion buffer size with
H5Pset_buffer(). Unless you are generating very complex, disjoint
selections and then changes the vector size with
H5Pset_hyper_vector_size() might help. (Both of these are only
applicable to independent I/O, BTW)

Thanks for the pointers. I think this data type falls under "complex,
disjoint" when viewed on a per-process basis. The collective I/O
optimizations are so important that even though it's twice as much
data to write out doubles, the writes go much much faster. We'll see
what the tuning parameters can do.

  OK, let me know if you find out anything interesting.

Can I override HDF5's choice here and force it to take a collective
I/O path?

Unfortunately no, since there's a type conversion that needs to
occur. I suppose it's possible that each process's amount of the
dataset is small enough that it could type convert into a buffer and
then perform a collective I/O operation, but that seems somewhat
unusual and we don't attempt to detect that currently.

We solve this in pnetcdf ... with lots of buffer copies :>

If the user-requested datatype is non-contiguous and we are doing a
write, we MPI_PACK() that type into a buffer, then MPI_UNPACK() that
buffer into a contiguous buffer. After that, we type-convert into a
third buffer and use that third buffer for the (collective) MPI write.

Reading is the reverse: collective read into a buffer, type convert,
MPI_Pack() into a buffer, MPI_Unpack() into the user's desired type.

  Yup, that's similar to what I was thinking. :slight_smile:

Earlier this year we worked on something a little more sophisticated
and a bit more higher performing: Rob Ross extracted the MPI type
processing into a standalone library that things such as
parallel-netcdf, HDF5 and ROMIO can use to rip through MPI types more
efficiently and apply transformations on the data.

http://www.mcs.anl.gov/~robl/papers/ross_datatype_lib.pdf

Now, I'm not suggesting we start re-working HDF5 type processing
tomorrow, but when the time comes, we've got options.

  Ah, very nice. :slight_smile: Yes, that'll be useful if/when we want to go this direction, thanks!

  Quincey

···

On Sep 10, 2009, at 8:18 AM, Rob Latham wrote:

On Wed, Sep 09, 2009 at 04:27:41PM -0500, Quincey Koziol wrote:

On Sep 9, 2009, at 3:02 PM, Robert Latham wrote: