Parallel HDF5 and independent I/O

matthieu.brucher · July 17, 2013, 9:51am

Hi,

I'm starting looking into HDF5 for structured output files, and I have
seen a lot of slides showing that collective I/O was far better than
independent I/O.
My application is a little bit different and doesn't fit the
collective I/O pattern. I have an unstructured grid split on several
nodes, and when I write a result, I need to reorder the data I want to
write so that it follows growing node indices (requirement of the
viewers, requirements of the post-processing...).
As the grid is unstructured and split with parmetis, I can't directly
write the data to disk, and I'm doing asynchronous gathers (using NBC)
of all data to a subset of process that will actually write. Due to
how the gathering is done (all ranks fill a chunk of indices with the
data they have and they are reduced at the writing rank), I can't have
buffers for all gathers at the same time, so I have a list of buffers
that I'm using and as soon as a gather is done, I call a callback on
the corresponding rank to write down everything.
This works really nice when I want to write one piece of data in a
binary file with pwrite, and even if the asynchronous gather process
is not optimized yet, I have improvements in the used I/O bandwidth as
soon as I use several blades (2 to 3 times better for using 2 or 4
blades).

Now I'd like to do the same with HDF5. So I will get new chunks with
the offset in the file each time a gather is done. If the chunks are
properly organized, I will have overlapping communication and
overlapping I/O as each chunk will be at another offset in the HDF5.

Will the HDF5 library behave properly, meaning will it directly write
on the disk waiting for other processes?

I can provide the test case by private email if someone has a clue!

Regards,

Matthieu Brucher

···

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

Mohamad_Chaarawi · July 17, 2013, 1:16pm

Hi Matthieu,

You probably know that already, but since HDF5 structures your data, there
is no such thing as accessing an offset in the file in HDF5. You will
create/open objects in the HDF5 file and access raw data in HDF5 Datasets
through dataspace selections.

Now to your question, you can create/open the file using a sub communicator
containing the I/O processes. Then you can write your data, in independent
mode, once each I/O process is ready to write its buffer.

One thing you have to note that creating the HDF5 objects that will organize
your file and hold your data (groups, datasets, etc...) has to be done
collectively (i.e. all processes have to participate in those calls). If you
know that information beforehand, you can do all of that on initialization,
like when you create the file. If you don't, then there has to be a
synchronization phase where all I/O processes get together and create those
objects whenever needed. Writing your raw data to the file can be done
independently or collectively.

Hope this helps,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of
Matthieu Brucher
Sent: Wednesday, July 17, 2013 4:51 AM
To: Hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Parallel HDF5 and independent I/O

Hi,

I'm starting looking into HDF5 for structured output files, and I have seen
a lot of slides showing that collective I/O was far better than independent
I/O.
My application is a little bit different and doesn't fit the collective I/O
pattern. I have an unstructured grid split on several nodes, and when I
write a result, I need to reorder the data I want to write so that it
follows growing node indices (requirement of the viewers, requirements of
the post-processing...).
As the grid is unstructured and split with parmetis, I can't directly write
the data to disk, and I'm doing asynchronous gathers (using NBC) of all data
to a subset of process that will actually write. Due to how the gathering is
done (all ranks fill a chunk of indices with the data they have and they are
reduced at the writing rank), I can't have buffers for all gathers at the
same time, so I have a list of buffers that I'm using and as soon as a
gather is done, I call a callback on the corresponding rank to write down
everything.
This works really nice when I want to write one piece of data in a binary
file with pwrite, and even if the asynchronous gather process is not
optimized yet, I have improvements in the used I/O bandwidth as soon as I
use several blades (2 to 3 times better for using 2 or 4 blades).

Now I'd like to do the same with HDF5. So I will get new chunks with the
offset in the file each time a gather is done. If the chunks are properly
organized, I will have overlapping communication and overlapping I/O as each
chunk will be at another offset in the HDF5.

Will the HDF5 library behave properly, meaning will it directly write on the
disk waiting for other processes?

I can provide the test case by private email if someone has a clue!

Regards,

Matthieu Brucher
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

matthieu.brucher · July 17, 2013, 1:30pm

Hi Mohamad,

Thanks for the tip. I already open the file in parallel and set up the
sets collectively with the proper dimensions and sizes, so I guess at
least on this part I'm OK!
My only worry is when I write (and read in the future) chunks
independently from the different ranks that opened the file. All the
slides and tutorials tell me that this is not the most efficient
scheme and yet it is the only one I can actually use reliably.

Regards,

Matthieu

···

2013/7/17 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Hi Matthieu,

You probably know that already, but since HDF5 structures your data, there
is no such thing as accessing an offset in the file in HDF5. You will
create/open objects in the HDF5 file and access raw data in HDF5 Datasets
through dataspace selections.

Now to your question, you can create/open the file using a sub communicator
containing the I/O processes. Then you can write your data, in independent
mode, once each I/O process is ready to write its buffer.

One thing you have to note that creating the HDF5 objects that will organize
your file and hold your data (groups, datasets, etc...) has to be done
collectively (i.e. all processes have to participate in those calls). If you
know that information beforehand, you can do all of that on initialization,
like when you create the file. If you don't, then there has to be a
synchronization phase where all I/O processes get together and create those
objects whenever needed. Writing your raw data to the file can be done
independently or collectively.

Hope this helps,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of
Matthieu Brucher
Sent: Wednesday, July 17, 2013 4:51 AM
To: Hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Parallel HDF5 and independent I/O

Hi,

I'm starting looking into HDF5 for structured output files, and I have seen
a lot of slides showing that collective I/O was far better than independent
I/O.
My application is a little bit different and doesn't fit the collective I/O
pattern. I have an unstructured grid split on several nodes, and when I
write a result, I need to reorder the data I want to write so that it
follows growing node indices (requirement of the viewers, requirements of
the post-processing...).
As the grid is unstructured and split with parmetis, I can't directly write
the data to disk, and I'm doing asynchronous gathers (using NBC) of all data
to a subset of process that will actually write. Due to how the gathering is
done (all ranks fill a chunk of indices with the data they have and they are
reduced at the writing rank), I can't have buffers for all gathers at the
same time, so I have a list of buffers that I'm using and as soon as a
gather is done, I call a callback on the corresponding rank to write down
everything.
This works really nice when I want to write one piece of data in a binary
file with pwrite, and even if the asynchronous gather process is not
optimized yet, I have improvements in the used I/O bandwidth as soon as I
use several blades (2 to 3 times better for using 2 or 4 blades).

Now I'd like to do the same with HDF5. So I will get new chunks with the
offset in the file each time a gather is done. If the chunks are properly
organized, I will have overlapping communication and overlapping I/O as each
chunk will be at another offset in the HDF5.

Will the HDF5 library behave properly, meaning will it directly write on the
disk waiting for other processes?

I can provide the test case by private email if someone has a clue!

Regards,

Matthieu Brucher
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

Mohamad_Chaarawi · July 17, 2013, 2:01pm

Yes independent access is not efficient if the data you are writing is
highly noncontiguous. It might be worth doing collective I/O in this case,
even if it might delay other ranks from writing to the file; but this all
depends on how loosely coupled your application is.

However if your data written contains large chunks of data that are stored
contiguously in the file, then you should be OK, because parallel file
systems love that.

In any case, independent access should give you similar performance with
what you got before using HDF5, when using pwrite.

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of
Matthieu Brucher
Sent: Wednesday, July 17, 2013 8:30 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel HDF5 and independent I/O

Hi Mohamad,

Thanks for the tip. I already open the file in parallel and set up the sets
collectively with the proper dimensions and sizes, so I guess at least on
this part I'm OK!
My only worry is when I write (and read in the future) chunks independently
from the different ranks that opened the file. All the slides and tutorials
tell me that this is not the most efficient scheme and yet it is the only
one I can actually use reliably.

Regards,

Matthieu

2013/7/17 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Hi Matthieu,

You probably know that already, but since HDF5 structures your data,
there is no such thing as accessing an offset in the file in HDF5. You
will create/open objects in the HDF5 file and access raw data in HDF5
Datasets through dataspace selections.

Now to your question, you can create/open the file using a sub
communicator containing the I/O processes. Then you can write your
data, in independent mode, once each I/O process is ready to write its

buffer.

One thing you have to note that creating the HDF5 objects that will
organize your file and hold your data (groups, datasets, etc...) has
to be done collectively (i.e. all processes have to participate in
those calls). If you know that information beforehand, you can do all
of that on initialization, like when you create the file. If you
don't, then there has to be a synchronization phase where all I/O
processes get together and create those objects whenever needed.
Writing your raw data to the file can be done independently or

collectively.

Hope this helps,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On
Behalf Of Matthieu Brucher
Sent: Wednesday, July 17, 2013 4:51 AM
To: Hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Parallel HDF5 and independent I/O

Hi,

I'm starting looking into HDF5 for structured output files, and I have
seen a lot of slides showing that collective I/O was far better than
independent I/O.
My application is a little bit different and doesn't fit the
collective I/O pattern. I have an unstructured grid split on several
nodes, and when I write a result, I need to reorder the data I want to
write so that it follows growing node indices (requirement of the
viewers, requirements of the post-processing...).
As the grid is unstructured and split with parmetis, I can't directly
write the data to disk, and I'm doing asynchronous gathers (using NBC)
of all data to a subset of process that will actually write. Due to
how the gathering is done (all ranks fill a chunk of indices with the
data they have and they are reduced at the writing rank), I can't have
buffers for all gathers at the same time, so I have a list of buffers
that I'm using and as soon as a gather is done, I call a callback on
the corresponding rank to write down everything.
This works really nice when I want to write one piece of data in a
binary file with pwrite, and even if the asynchronous gather process
is not optimized yet, I have improvements in the used I/O bandwidth as
soon as I use several blades (2 to 3 times better for using 2 or 4

blades).

Now I'd like to do the same with HDF5. So I will get new chunks with
the offset in the file each time a gather is done. If the chunks are
properly organized, I will have overlapping communication and
overlapping I/O as each chunk will be at another offset in the HDF5.

Will the HDF5 library behave properly, meaning will it directly write
on the disk waiting for other processes?

I can provide the test case by private email if someone has a clue!

Regards,

Matthieu Brucher
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

matthieu.brucher · July 17, 2013, 2:15pm

OK, so I guess I'm good, as I gather chunks so that they are 8MiB (the
usual size I'm using on our parallel filesystem with pwrite).
Now I need to check if I'm not making stupid errors and if I don't add
useless overhead (I've seen some strange slow downs when going from
HDF5 with one file per rank to collective HDF5 on the same file on the
same FS with the same number of used OSS).

Thanks for the confirmation!

Regards,

Matthieu

···

2013/7/17 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Yes independent access is not efficient if the data you are writing is
highly noncontiguous. It might be worth doing collective I/O in this case,
even if it might delay other ranks from writing to the file; but this all
depends on how loosely coupled your application is.

However if your data written contains large chunks of data that are stored
contiguously in the file, then you should be OK, because parallel file
systems love that.

In any case, independent access should give you similar performance with
what you got before using HDF5, when using pwrite.

Thanks,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of
Matthieu Brucher
Sent: Wednesday, July 17, 2013 8:30 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel HDF5 and independent I/O

Hi Mohamad,

Thanks for the tip. I already open the file in parallel and set up the sets
collectively with the proper dimensions and sizes, so I guess at least on
this part I'm OK!
My only worry is when I write (and read in the future) chunks independently
from the different ranks that opened the file. All the slides and tutorials
tell me that this is not the most efficient scheme and yet it is the only
one I can actually use reliably.

Regards,

Matthieu

2013/7/17 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Hi Matthieu,

You probably know that already, but since HDF5 structures your data,
there is no such thing as accessing an offset in the file in HDF5. You
will create/open objects in the HDF5 file and access raw data in HDF5
Datasets through dataspace selections.

Now to your question, you can create/open the file using a sub
communicator containing the I/O processes. Then you can write your
data, in independent mode, once each I/O process is ready to write its

buffer.

One thing you have to note that creating the HDF5 objects that will
organize your file and hold your data (groups, datasets, etc...) has
to be done collectively (i.e. all processes have to participate in
those calls). If you know that information beforehand, you can do all
of that on initialization, like when you create the file. If you
don't, then there has to be a synchronization phase where all I/O
processes get together and create those objects whenever needed.
Writing your raw data to the file can be done independently or

collectively.

Hope this helps,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On
Behalf Of Matthieu Brucher
Sent: Wednesday, July 17, 2013 4:51 AM
To: Hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Parallel HDF5 and independent I/O

Hi,

I'm starting looking into HDF5 for structured output files, and I have
seen a lot of slides showing that collective I/O was far better than
independent I/O.
My application is a little bit different and doesn't fit the
collective I/O pattern. I have an unstructured grid split on several
nodes, and when I write a result, I need to reorder the data I want to
write so that it follows growing node indices (requirement of the
viewers, requirements of the post-processing...).
As the grid is unstructured and split with parmetis, I can't directly
write the data to disk, and I'm doing asynchronous gathers (using NBC)
of all data to a subset of process that will actually write. Due to
how the gathering is done (all ranks fill a chunk of indices with the
data they have and they are reduced at the writing rank), I can't have
buffers for all gathers at the same time, so I have a list of buffers
that I'm using and as soon as a gather is done, I call a callback on
the corresponding rank to write down everything.
This works really nice when I want to write one piece of data in a
binary file with pwrite, and even if the asynchronous gather process
is not optimized yet, I have improvements in the used I/O bandwidth as
soon as I use several blades (2 to 3 times better for using 2 or 4

blades).

Now I'd like to do the same with HDF5. So I will get new chunks with
the offset in the file each time a gather is done. If the chunks are
properly organized, I will have overlapping communication and
overlapping I/O as each chunk will be at another offset in the HDF5.

Will the HDF5 library behave properly, meaning will it directly write
on the disk waiting for other processes?

I can provide the test case by private email if someone has a clue!

Regards,

Matthieu Brucher
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: Matthieu Brucher - Squarepoint Capital | LinkedIn
Music band: http://liliejay.com/

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Parallel HDF5 and independent I/O