Scattered read and write

Good morning,

I'm working with a parallel FORTRAN program for unstructured grid fluid flow modeling. In this program each processor writes data from the contiguous buffer into irregularly scattered locations in the file. For that I used the subroutine: h5sselect_elements_f(space_id, operator, num_elements, coord, hdferr) to specify my writing pattern followed by the standard writing routine. Unfortunately this approach is really very slow. I have also the same problem for the scattered reading.

Could you please tell me if there is another way to read/write scattered data?

Regards,

Mokhles

···

________________________________
The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.

Do you know where your program is spending the bulk of its time? Is
it spending a lot of time in HDF5 processing the dataset, or is it
spending a lot of time writing to or reading from the file system?

I know answering that question is not entirely straightforward. If
you had a library that would report time spent in HDF5 calls and time
spent in MPI-IO calls, that would tell you where you should spend your
tuning efforts.

If you can put together a small self-contained test program that
demonstrates this slow I/O performance, that would be pretty helpful.

==rob

···

On Tue, May 19, 2009 at 04:04:29PM +0300, Mezghani, Mokhles B wrote:

I'm working with a parallel FORTRAN program for unstructured grid
fluid flow modeling. In this program each processor writes data from
the contiguous buffer into irregularly scattered locations in the
file. For that I used the subroutine: h5sselect_elements_f(space_id,
operator, num_elements, coord, hdferr) to specify my writing pattern
followed by the standard writing routine. Unfortunately this
approach is really very slow. I have also the same problem for the
scattered reading.

Could you please tell me if there is another way to read/write
scattered data?

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Good morning Rob,

The bulk time is spent in the reading/writing phase. You will find in attachment a Fortran small program to write scattered data. You can use this program to see the problem. Please let me know if you need any additional information or examples.

Regards,

Mokhles

Main.f90 (4.09 KB)

···

-----Original Message-----
From: Rob Latham [mailto:robl@mcs.anl.gov]
Sent: Tuesday, May 19, 2009 6:20 PM
To: Mezghani, Mokhles B
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Scattered read and write

On Tue, May 19, 2009 at 04:04:29PM +0300, Mezghani, Mokhles B wrote:

I'm working with a parallel FORTRAN program for unstructured grid
fluid flow modeling. In this program each processor writes data from
the contiguous buffer into irregularly scattered locations in the
file. For that I used the subroutine: h5sselect_elements_f(space_id,
operator, num_elements, coord, hdferr) to specify my writing pattern
followed by the standard writing routine. Unfortunately this
approach is really very slow. I have also the same problem for the
scattered reading.

Could you please tell me if there is another way to read/write
scattered data?

Do you know where your program is spending the bulk of its time? Is
it spending a lot of time in HDF5 processing the dataset, or is it
spending a lot of time writing to or reading from the file system?

I know answering that question is not entirely straightforward. If
you had a library that would report time spent in HDF5 calls and time
spent in MPI-IO calls, that would tell you where you should spend your
tuning efforts.

If you can put together a small self-contained test program that
demonstrates this slow I/O performance, that would be pretty helpful.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.

What I mean to determine is if the overhead is in the MPI-IO layer, or
in the HDF5 layer.

Thank you very much for the testcase. It's exactly what I hoped you'd
send. I can confirm that this code you sent is slow. dirt slow.
Roughly 1 MB per 10 minutes -- I had to cut down the number of points
to 100k just so it would finish in a reasonable amount of time :>

I can see that for me, HDF5 is turning a collective h5dwrite_f into N
individual MPI_File_write_at calls. I don't know anything about HDF5
internals, but you've described all the elements of the dataset you
want with h5sselect_elements_f. I would have expected HDF5 to
construct a monster datatype, feed that into MPI_File_write_at_all ...
and then send me a bug report when that doesn't work :>

I'm testing with HDF5-1.8.0.

HDF5 folks: is it possible I have an improperly-built HDF5? What I
mean is would you expect h5sselect_elements_f to behave as I
described, making a single (or few) calls to MPI_File_write_at_all ?

==rob

···

On Wed, May 20, 2009 at 09:08:32AM +0300, Mezghani, Mokhles B wrote:

Good morning Rob,

The bulk time is spent in the reading/writing phase. You will find
in attachment a Fortran small program to write scattered data. You
can use this program to see the problem. Please let me know if you
need any additional information or examples.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Rob,

Currently HDF5 doesn't support collective calls for the point selection;
it quietly switches to use independent I/O.

I hope Quincey or someone else who is involved in parallel work, will
elaborate.

May be you can try to use hyperslaqb selections instead of point
selections even it is only one element in each hyperslab? And I would
definitely go with 1.8.3.

Hi Elena. I guess I'm in Quincey territory now.

For this case where a program makes a single HDF5 call, I'd like to
see HDF5 make as few MPI-IO calls as possible. Even if you don't use
collective I/O, you could still create an indexed or hindexed MPI
datatype describing all the types in memory and then make a single
MPI_FILE_WRITE call.

It is highly likely there is something about the HDF5 file format I do
not understand and which would preclude this approach.

I just wanted to point out that you could see tremendous performance
gains with this workload and not even have to go all the way down to
full collective i/o.

I know Quincey's next email will contain a phrase along the lines of
"as funding sources permit us to work on this", so you won't hurt my
feelings if you have to shelve this for a while :>

  Elena's right - we don't have a "fast" path in the library for performing I/O on point selections. We felt that it would only be used for 'small' numbers of elements and that it wouldn't require an special support. I'm actually very surprised that someone is using it for selection with many elements - Elena's suggestion of using a hyperslab selection for that case would almost certainly work better.

  As you say, if someone with funding found this important, we would be happy to optimize this case. We'd also accept a well-written patch which improved the performance.

  Quincey

···

On May 20, 2009, at 2:41 PM, Rob Latham wrote:

On Wed, May 20, 2009 at 01:25:26PM -0500, Elena Pourmal wrote:

Good morning Rob,

First of all I would like to thank the HDF community for the help and support. As I said, I need really to find a solution to this problem. I already adopted parallel HDF5 as a file format to my application. And I'm surprised by the performance with the scattered read and write. I think that the overhead is in the HDF5 layer. In fact one of my colleague is using MPI2 for scattered read and write and the performance are really very good. If it could help, I can try to make a simple program using MPI2.

Please, any suggestion is really appreciated.

Regards,

Mokhles

···

________________________________________
From: Rob Latham [robl@mcs.anl.gov]
Sent: Wednesday, May 20, 2009 7:57 PM
To: Mezghani, Mokhles B
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Scattered read and write

On Wed, May 20, 2009 at 09:08:32AM +0300, Mezghani, Mokhles B wrote:

Good morning Rob,

The bulk time is spent in the reading/writing phase. You will find
in attachment a Fortran small program to write scattered data. You
can use this program to see the problem. Please let me know if you
need any additional information or examples.

What I mean to determine is if the overhead is in the MPI-IO layer, or
in the HDF5 layer.

Thank you very much for the testcase. It's exactly what I hoped you'd
send. I can confirm that this code you sent is slow. dirt slow.
Roughly 1 MB per 10 minutes -- I had to cut down the number of points
to 100k just so it would finish in a reasonable amount of time :>

I can see that for me, HDF5 is turning a collective h5dwrite_f into N
individual MPI_File_write_at calls. I don't know anything about HDF5
internals, but you've described all the elements of the dataset you
want with h5sselect_elements_f. I would have expected HDF5 to
construct a monster datatype, feed that into MPI_File_write_at_all ...
and then send me a bug report when that doesn't work :>

I'm testing with HDF5-1.8.0.

HDF5 folks: is it possible I have an improperly-built HDF5? What I
mean is would you expect h5sselect_elements_f to behave as I
described, making a single (or few) calls to MPI_File_write_at_all ?

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Quincey,

The code that we are developing is for parallel flow modeling using unstructured grids. In this framework, before starting the simulation you need to partition you grid cells to the different processors. The grid partitioning is done using different criteria (minimize the mpi communication, calculation balancing, etc). Because of the unstructured framework, the partition could not be contiguous and you will need to use scattered write to output your simulation results. In our application our grid could be
made by thousand millions of cells and the scattered writing become really a big issue. To be honest I' surprised that nobody is using the scattered read and write with the parallel HDF5 library. As I said I will made additional test on Saturday using the 1.8.3 release.

Thanks

Mokhles

···

________________________________________
From: Quincey Koziol [koziol@hdfgroup.org]
Sent: Thursday, May 21, 2009 12:38 AM
To: Rob Latham
Cc: Mezghani, Mokhles B; hdf-forum forum
Subject: Re: [hdf-forum] Scattered read and write

Hi Rob,

On May 20, 2009, at 2:41 PM, Rob Latham wrote:

On Wed, May 20, 2009 at 01:25:26PM -0500, Elena Pourmal wrote:

Currently HDF5 doesn't support collective calls for the point
selection;
it quietly switches to use independent I/O.

I hope Quincey or someone else who is involved in parallel work, will
elaborate.

May be you can try to use hyperslaqb selections instead of point
selections even it is only one element in each hyperslab? And I would
definitely go with 1.8.3.

Hi Elena. I guess I'm in Quincey territory now.

For this case where a program makes a single HDF5 call, I'd like to
see HDF5 make as few MPI-IO calls as possible. Even if you don't use
collective I/O, you could still create an indexed or hindexed MPI
datatype describing all the types in memory and then make a single
MPI_FILE_WRITE call.

It is highly likely there is something about the HDF5 file format I do
not understand and which would preclude this approach.

I just wanted to point out that you could see tremendous performance
gains with this workload and not even have to go all the way down to
full collective i/o.

I know Quincey's next email will contain a phrase along the lines of
"as funding sources permit us to work on this", so you won't hurt my
feelings if you have to shelve this for a while :>

        Elena's right - we don't have a "fast" path in the library for
performing I/O on point selections. We felt that it would only be
used for 'small' numbers of elements and that it wouldn't require an
special support. I'm actually very surprised that someone is using it
for selection with many elements - Elena's suggestion of using a
hyperslab selection for that case would almost certainly work better.

        As you say, if someone with funding found this important, we would be
happy to optimize this case. We'd also accept a well-written patch
which improved the performance.

        Quincey

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.

Hello Mokhles,

While it's possible that others are using HDF5 in the manner you describe (scattered read and write with parallel HDF5), none are providing us with the funding necessary to improve the performance for these scenarios. The HDF Group tries address performance issues that are brought to our attention, but other things currently have higher-priority in our self-funded work queue.

The HDF Group does offer custom development and performance tuning services, and I'd be happy to discuss rates with you (or others) if you find the current behavior is significantly hampering your progress.

-Ruth

···

------------------------------------------------------------
Ruth Aydt
Director of Sponsored Projects and Business Development
The HDF Group
1901 South First Street, Suite C-2
Champaign, IL 61820

aydt@hdfgroup.org (217)265-7837
------------------------------------------------------------

On May 21, 2009, at 4:15 AM, Mezghani, Mokhles B wrote:

Hi Quincey,

The code that we are developing is for parallel flow modeling using unstructured grids. In this framework, before starting the simulation you need to partition you grid cells to the different processors. The grid partitioning is done using different criteria (minimize the mpi communication, calculation balancing, etc). Because of the unstructured framework, the partition could not be contiguous and you will need to use scattered write to output your simulation results. In our application our grid could be
made by thousand millions of cells and the scattered writing become really a big issue. To be honest I' surprised that nobody is using the scattered read and write with the parallel HDF5 library. As I said I will made additional test on Saturday using the 1.8.3 release.

Thanks

Mokhles
________________________________________
From: Quincey Koziol [koziol@hdfgroup.org]
Sent: Thursday, May 21, 2009 12:38 AM
To: Rob Latham
Cc: Mezghani, Mokhles B; hdf-forum forum
Subject: Re: [hdf-forum] Scattered read and write

Hi Rob,

On May 20, 2009, at 2:41 PM, Rob Latham wrote:

On Wed, May 20, 2009 at 01:25:26PM -0500, Elena Pourmal wrote:

Currently HDF5 doesn't support collective calls for the point
selection;
it quietly switches to use independent I/O.

I hope Quincey or someone else who is involved in parallel work, will
elaborate.

May be you can try to use hyperslaqb selections instead of point
selections even it is only one element in each hyperslab? And I would
definitely go with 1.8.3.

Hi Elena. I guess I'm in Quincey territory now.

For this case where a program makes a single HDF5 call, I'd like to
see HDF5 make as few MPI-IO calls as possible. Even if you don't use
collective I/O, you could still create an indexed or hindexed MPI
datatype describing all the types in memory and then make a single
MPI_FILE_WRITE call.

It is highly likely there is something about the HDF5 file format I do
not understand and which would preclude this approach.

I just wanted to point out that you could see tremendous performance
gains with this workload and not even have to go all the way down to
full collective i/o.

I know Quincey's next email will contain a phrase along the lines of
"as funding sources permit us to work on this", so you won't hurt my
feelings if you have to shelve this for a while :>

       Elena's right - we don't have a "fast" path in the library for
performing I/O on point selections. We felt that it would only be
used for 'small' numbers of elements and that it wouldn't require an
special support. I'm actually very surprised that someone is using it
for selection with many elements - Elena's suggestion of using a
hyperslab selection for that case would almost certainly work better.

       As you say, if someone with funding found this important, we would be
happy to optimize this case. We'd also accept a well-written patch
which improved the performance.

       Quincey

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.