Merging of .h5 files / parallel access

Stefan_Novak · August 6, 2009, 1:48pm

Hi all,

I've been poking around the HDF5 documentation for a few days trying to find
a solution to a problem that I've encountered, but I haven't been able to
find any articles that point me in the right direction. Currently, the tool
I'm developing will allow a small cluster of machines to contribute analysis
to a central HDF5 file. In terms of merging the results, there are two
potential options: merging of separate .h5 files or piece together a
framework to utilize the parallel HDF5 interface.

From what I've read, there are two methods of achieving the first option:

   1. Whenever a user needs to access the data, they will create a temporary
   .h5 file, will mount all the .h5 files generated by each of the cluster
   machines into the temporary file, then close out when everything is done.
   2. Utilize the H5Ocopy method to copy over the results from each of the
   cluster into a central .h5 file.

The first method seems appealing since you can distribute your analysis
easily - if you want to exclude dated analysis from your dataset, simply
move it out of the "current configuration" file and it won't be mounted when
a user access the data. However, several members of the team use HDFView
quite a bit, so, they won't be able to inspect the full dataset as easily.

The second method I'm not sure about. If you copy of a group object which
contains several datasets under it, does H5Ocopy transfer all the children
objects as well?

As for the second option - is it possible to have several machines
contributing data to an HDF5 file at the same time? How does one manage
this?

Please let me know if you've stumbled across this problem at some point
before and if you have any tips and tricks to offer.

Thanks!

···

--
Stefan Novak
Sent from Greenbelt, Maryland, United States

miller86 · August 6, 2009, 6:56pm

Hi all,

I've been poking around the HDF5 documentation for a few days trying
to find a solution to a problem that I've encountered, but I haven't
been able to find any articles that point me in the right direction.
Currently, the tool I'm developing will allow a small cluster of
machines to contribute analysis to a central HDF5 file. In terms of
merging the results, there are two potential options: merging of
separate .h5 files or piece together a framework to utilize the
parallel HDF5 interface.

I am confused here by what you mean by merge. On the one hand you say a
cluster of machines 'contributes analysis to a central HDF5 file' (one
file). But in the next sentence you talk about 'merging of separate .h5
files'

By 'merge' here, do you mean that from an input of several datasets from
different machines you create a single, larger dataset? Or, do you only
mean that from an input of many files with many datasets, you create a
single file with many datasets? If the latter, I am not sure what either
mounting the many datasets from many files or copying the many datasets
from many files into a single file buys you that you can't already do
by, for example, putting them all in the same dir on a shared
filesystem. At any rate, I am not sure how parallel HDF5 would help in
this regard unless you intend to read many datasets into memory, merge
them and write one large dataset back out, in parallel.

From what I've read, there are two methods of achieving the first
option:
     1. Whenever a user needs to access the data, they will create a
        temporary .h5 file, will mount all the .h5 files generated by
        each of the cluster machines into the temporary file, then
        close out when everything is done.
     2. Utilize the H5Ocopy method to copy over the results from each
        of the cluster into a central .h5 file.
The first method seems appealing since you can distribute your
analysis easily - if you want to exclude dated analysis from your
dataset, simply move it out of the "current configuration" file and it
won't be mounted when a user access the data. However, several
members of the team use HDFView quite a bit, so, they won't be able to
inspect the full dataset as easily.

The second method I'm not sure about. If you copy of a group object
which contains several datasets under it, does H5Ocopy transfer all
the children objects as well?

My read of the documentation for H5Pset_copy_object is that
H5O_COPY_SHALLOW_HIERARCHY_FLAG controls this behavior. Default is to
recurse on 'all objects below' a group. So, my read is that yes it will
copy the 'children objects as well'

As for the second option - is it possible to have several machines
contributing data to an HDF5 file at the same time? How does one
manage this?

Concurrent I/O to the same file via HDF5 is going to REQUIRE the use of
HDF5's parallel driver as well as the various limitations (collectivity
in various metadata-changing operations such as H5Dcreate) that come
with it. If multiple serial HDF5 clients open the SAME HDF5 file and try
to write to it, that'll create problems.

···

On Thu, 2009-08-06 at 06:48, Stefan Novak wrote:

Please let me know if you've stumbled across this problem at some
point before and if you have any tips and tricks to offer.

Thanks!

--
Stefan Novak
Sent from Greenbelt, Maryland, United States

______________________________________________________________________
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

Quincey_Koziol · August 7, 2009, 9:10pm

Hi all,

Hi all,

I've been poking around the HDF5 documentation for a few days trying
to find a solution to a problem that I've encountered, but I haven't
been able to find any articles that point me in the right direction.
Currently, the tool I'm developing will allow a small cluster of
machines to contribute analysis to a central HDF5 file. In terms of
merging the results, there are two potential options: merging of
separate .h5 files or piece together a framework to utilize the
parallel HDF5 interface.

I am confused here by what you mean by merge. On the one hand you say a
cluster of machines 'contributes analysis to a central HDF5 file' (one
file). But in the next sentence you talk about 'merging of separate .h5
files'

By 'merge' here, do you mean that from an input of several datasets from
different machines you create a single, larger dataset? Or, do you only
mean that from an input of many files with many datasets, you create a
single file with many datasets? If the latter, I am not sure what either
mounting the many datasets from many files or copying the many datasets
from many files into a single file buys you that you can't already do
by, for example, putting them all in the same dir on a shared
filesystem. At any rate, I am not sure how parallel HDF5 would help in
this regard unless you intend to read many datasets into memory, merge
them and write one large dataset back out, in parallel.

From what I've read, there are two methods of achieving the first
option:
    1. Whenever a user needs to access the data, they will create a
       temporary .h5 file, will mount all the .h5 files generated by
       each of the cluster machines into the temporary file, then
       close out when everything is done.
    2. Utilize the H5Ocopy method to copy over the results from each
       of the cluster into a central .h5 file.

There's a third option also - use the "external links" feature to create a central file with links to objects stored in other HDF5 files. The latest version of HDFView should handle files with external links correctly, even.

The first method seems appealing since you can distribute your
analysis easily - if you want to exclude dated analysis from your
dataset, simply move it out of the "current configuration" file and it
won't be mounted when a user access the data. However, several
members of the team use HDFView quite a bit, so, they won't be able to
inspect the full dataset as easily.

The second method I'm not sure about. If you copy of a group object
which contains several datasets under it, does H5Ocopy transfer all
the children objects as well?

My read of the documentation for H5Pset_copy_object is that
H5O_COPY_SHALLOW_HIERARCHY_FLAG controls this behavior. Default is to
recurse on 'all objects below' a group. So, my read is that yes it will
copy the 'children objects as well'

Yup.

As for the second option - is it possible to have several machines
contributing data to an HDF5 file at the same time? How does one
manage this?

Concurrent I/O to the same file via HDF5 is going to REQUIRE the use of
HDF5's parallel driver as well as the various limitations (collectivity
in various metadata-changing operations such as H5Dcreate) that come
with it. If multiple serial HDF5 clients open the SAME HDF5 file and try
to write to it, that'll create problems.

Yes, this is definitely true for writing, currently. We are moving toward allowing single-writer/multiple-reader (SWMR) access to a file, which is nicer, but still doesn't allow multiple writer processes to update a file concurrently.

Quincey

···

On Aug 6, 2009, at 1:56 PM, Mark Miller wrote:

On Thu, 2009-08-06 at 06:48, Stefan Novak wrote:

Please let me know if you've stumbled across this problem at some
point before and if you have any tips and tricks to offer.

Thanks!

--
Stefan Novak
Sent from Greenbelt, Maryland, United States

______________________________________________________________________
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Merging of .h5 files / parallel access