Parallel dataset resizing strategies

Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience with MPI it is not extensive. We are exploring ways of saving data in parallel using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  * A typical job processes many "events."
  * An event contains all of the interesting data (raw and processed)
    associated with some time interval.
  * Each event can be processed independently of all other events.
  * Each event's data can be subdivided into internal components, "data
    products."
  * "Modules" are processing subunits which read or generate one or more
    data products for each event.
  * One can calculate a data dependency graph specifying the allowed
    ordering and/or parallelism of modules processing one or more events
    simultaneously for a given job configuration and event structure.

We have been using h5py with HDF5 and OpenMPI to explore different strategies for parallel I/O in a future parallel event-processing framework. One of the approaches we have come up with so far is to have one HDF5 dataset per unique data product / writer module combination, keeping track of the different relevant sections of each dataset via (for now) an external database. This works well in serial tests, but in parallel tests we are running up against the constraint that dataset resizing is a collective operation, meaning that all ranks including non-writers will have to become aware of and duplicate dataset resizing operations required by other writers. The problem seems to get even worse if there's a possibility that two or more instances of a module would need to extend and write to the same dataset at the same time (while processing different events, say), since they will have to coordinate and agree on the new size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

···

--
Chris Green <greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM: greenc@jabber.fnal.gov, chissgreen (AIM),
chris.h.green (Google Talk).

If you can move to HDF5 1.10, I would recommend independent files for each MPI rank, and then create a master file (created independently perhaps by rank 0) with Virtual Datasets linking in the data from each rank in the format you need. Virtual Datasets can be created with file matching patterns for dynamically increasing datasets, so you might look into using that feature.
I found this approach much faster than creating a collective file (~5-10x speedup on a Lustre filesystem). You don’t need to do any collective reads or writes, and I think we could even bypass using parallel HDF5 altogether. Note, this will only work if you only ever need to open the Virtual Dataset in parallel (i.e. by more than one process) as non-collective read-only. If you need to have read-write access to the master file, you can’t access a Virtual Dataset using collective operations. You can, however, have as many processes as you like read from a virtual dataset from a file opened as read-only.

If you have other tools that use your data but can’t move to HDF5 1.10, you can h5repack a file with Virtual Datasets to remove the Virtual Datasets, and it should be compatible with HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or later). This also worked well for us and I was able to load a repacked file in IDL under a 1.8 HDF5 library. However h5repack is not a parallel application, so it can be slow to repack a very large file, on the order minutes per GB.

Jarom

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Friday, July 22, 2016 9:32 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Parallel dataset resizing strategies

Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience with MPI it is not extensive. We are exploring ways of saving data in parallel using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  * A typical job processes many "events."
  * An event contains all of the interesting data (raw and processed) associated with some time interval.
  * Each event can be processed independently of all other events.
  * Each event's data can be subdivided into internal components, "data products."
  * "Modules" are processing subunits which read or generate one or more data products for each event.
  * One can calculate a data dependency graph specifying the allowed ordering and/or parallelism of modules processing one or more events simultaneously for a given job configuration and event structure.
We have been using h5py with HDF5 and OpenMPI to explore different strategies for parallel I/O in a future parallel event-processing framework. One of the approaches we have come up with so far is to have one HDF5 dataset per unique data product / writer module combination, keeping track of the different relevant sections of each dataset via (for now) an external database. This works well in serial tests, but in parallel tests we are running up against the constraint that dataset resizing is a collective operation, meaning that all ranks including non-writers will have to become aware of and duplicate dataset resizing operations required by other writers. The problem seems to get even worse if there's a possibility that two or more instances of a module would need to extend and write to the same dataset at the same time (while processing different events, say), since they will have to coordinate and agree on the new size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

Hi,

Thanks for this. Comments inline.

If you can move to HDF5 1.10, I would recommend independent files for each MPI rank, and then create a master file (created independently perhaps by rank 0) with Virtual Datasets linking in the data from each rank in the format you need. Virtual Datasets can be created with file matching patterns for dynamically increasing datasets, so you might look into using that feature.

We don't have existing tools relying on a particular version, so we are nominally free to move to HDF5 1.10.x. However, it won't be completely straightforward because I have been relying for now on using the homebrew version, which is currently 1.18.16. I'd have to dink the recipe to use 1.10.x, which is not a showstopper.

I found this approach much faster than creating a collective file (~5-10x speedup on a Lustre filesystem). You don’t need to do any collective reads or writes, and I think we could even bypass using parallel HDF5 altogether. Note, this will only work if you only ever need to open the Virtual Dataset in parallel (i.e. by more than one process) as non-collective read-only. If you need to have read-write access to the master file, you can’t access a Virtual Dataset using collective operations. You can, however, have as many processes as you like read from a virtual dataset from a file opened as read-only.

If you have other tools that use your data but can’t move to HDF5 1.10, you can h5repack a file with Virtual Datasets to remove the Virtual Datasets, and it should be compatible with HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or later). This also worked well for us and I was able to load a repacked file in IDL under a 1.8 HDF5 library. However h5repack is not a parallel application, so it can be slow to repack a very large file, on the order minutes per GB.

After having thought a little more about likely parallel models, I think now we can arrange that:

···

On 7/22/16 12:13 PM, Nelson, Jarom wrote:

  *

    Only one rank will write to a particular dataset.

  *

    A dataset will not be read from in the same job in which it was written.

  *

    A dataset may be read by one or more ranks.

I *think* if that's the case, we could use a hierarchical multi-file format without resorting to virtual datasets, no? I still have some reading and experimenting to do, but if you have particular information that would speak to the likely success of this approach, I'd be happy to hear it.

Thanks,

Chris.

Jarom

*From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On Behalf Of *Chris Green
*Sent:* Friday, July 22, 2016 9:32 AM
*To:* hdf-forum@lists.hdfgroup.org
*Subject:* [Hdf-forum] Parallel dataset resizing strategies

Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience with MPI it is not extensive. We are exploring ways of saving data in parallel using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  * A typical job processes many "events."
  * An event contains all of the interesting data (raw and processed)
    associated with some time interval.
  * Each event can be processed independently of all other events.
  * Each event's data can be subdivided into internal components,
    "data products."
  * "Modules" are processing subunits which read or generate one or
    more data products for each event.
  * One can calculate a data dependency graph specifying the allowed
    ordering and/or parallelism of modules processing one or more
    events simultaneously for a given job configuration and event
    structure.

We have been using h5py with HDF5 and OpenMPI to explore different strategies for parallel I/O in a future parallel event-processing framework. One of the approaches we have come up with so far is to have one HDF5 dataset per unique data product / writer module combination, keeping track of the different relevant sections of each dataset via (for now) an external database. This works well in serial tests, but in parallel tests we are running up against the constraint that dataset resizing is a collective operation, meaning that all ranks including non-writers will have to become aware of and duplicate dataset resizing operations required by other writers. The problem seems to get even worse if there's a possibility that two or more instances of a module would need to extend and write to the same dataset at the same time (while processing different events, say), since they will have to coordinate and agree on the new size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

--
Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),
chris.h.green (Google Talk).

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Chris Green <greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM: greenc@jabber.fnal.gov, chissgreen (AIM),
chris.h.green (Google Talk).

If you want to have multiple ranks write to the same file, you’ll need to open the file in read-write and use parallel HDF5 with the associated overhead and complexity of the collective calls. I think the only way to avoid the overhead of the collective calls is to open separate files for each rank.

If you are going to have a multi-file approach, and read from files which are open in write mode by another process, you’ll need to have some way to get the metadata updated in the reading processes. It sounds like you might try another 1.10.x addition, the single-writer multiple-reader. If each rank can open its own output file in read-write, and all the other ranks’ files in read-only, you can avoid the parallel overhead. I haven’t tried this approach, and you’ll have to be careful of race conditions and keep the file metadata correct in all the ranks, but it sounds like it might fit your parallel I/O model. https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

Jarom

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Monday, July 25, 2016 3:41 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel dataset resizing strategies

Hi,

Thanks for this. Comments inline.
On 7/22/16 12:13 PM, Nelson, Jarom wrote:

If you can move to HDF5 1.10, I would recommend independent files for each MPI rank, and then create a master file (created independently perhaps by rank 0) with Virtual Datasets linking in the data from each rank in the format you need. Virtual Datasets can be created with file matching patterns for dynamically increasing datasets, so you might look into using that feature.

We don't have existing tools relying on a particular version, so we are nominally free to move to HDF5 1.10.x. However, it won't be completely straightforward because I have been relying for now on using the homebrew version, which is currently 1.18.16. I'd have to dink the recipe to use 1.10.x, which is not a showstopper.
I found this approach much faster than creating a collective file (~5-10x speedup on a Lustre filesystem). You don’t need to do any collective reads or writes, and I think we could even bypass using parallel HDF5 altogether. Note, this will only work if you only ever need to open the Virtual Dataset in parallel (i.e. by more than one process) as non-collective read-only. If you need to have read-write access to the master file, you can’t access a Virtual Dataset using collective operations. You can, however, have as many processes as you like read from a virtual dataset from a file opened as read-only.

If you have other tools that use your data but can’t move to HDF5 1.10, you can h5repack a file with Virtual Datasets to remove the Virtual Datasets, and it should be compatible with HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or later). This also worked well for us and I was able to load a repacked file in IDL under a 1.8 HDF5 library. However h5repack is not a parallel application, so it can be slow to repack a very large file, on the order minutes per GB.

After having thought a little more about likely parallel models, I think now we can arrange that:

· Only one rank will write to a particular dataset.

· A dataset will not be read from in the same job in which it was written.

· A dataset may be read by one or more ranks.

I *think* if that's the case, we could use a hierarchical multi-file format without resorting to virtual datasets, no? I still have some reading and experimenting to do, but if you have particular information that would speak to the likely success of this approach, I'd be happy to hear it.

Thanks,

Chris.

Jarom

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Friday, July 22, 2016 9:32 AM
To: hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] Parallel dataset resizing strategies

Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience with MPI it is not extensive. We are exploring ways of saving data in parallel using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  * A typical job processes many "events."
  * An event contains all of the interesting data (raw and processed) associated with some time interval.
  * Each event can be processed independently of all other events.
  * Each event's data can be subdivided into internal components, "data products."
  * "Modules" are processing subunits which read or generate one or more data products for each event.
  * One can calculate a data dependency graph specifying the allowed ordering and/or parallelism of modules processing one or more events simultaneously for a given job configuration and event structure.
We have been using h5py with HDF5 and OpenMPI to explore different strategies for parallel I/O in a future parallel event-processing framework. One of the approaches we have come up with so far is to have one HDF5 dataset per unique data product / writer module combination, keeping track of the different relevant sections of each dataset via (for now) an external database. This works well in serial tests, but in parallel tests we are running up against the constraint that dataset resizing is a collective operation, meaning that all ranks including non-writers will have to become aware of and duplicate dataset resizing operations required by other writers. The problem seems to get even worse if there's a possibility that two or more instances of a module would need to extend and write to the same dataset at the same time (while processing different events, say), since they will have to coordinate and agree on the new size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

Hi,

Thanks for continuing the conversation. I believe files will be *either* read from, *or* written to, but not both simultaneously, at least in the scenario I'm working on right now. I'd like to be able to write to the same file from different ranks simultaneously, but only to different datasets. If that's not possible without propagating dataset extension operations collectively to ranks not writing to that dataset, then I will start looking at the virtual dataset solution you suggested in your first reply.

Thanks again,
Chris.

···

On 7/25/16 7:41 PM, Nelson, Jarom wrote:

If you want to have multiple ranks write to the same file, you’ll need to open the file in read-write and use parallel HDF5 with the associated overhead and complexity of the collective calls. I think the only way to avoid the overhead of the collective calls is to open separate files for each rank.

If you are going to have a multi-file approach, and read from files which are open in write mode by another process, you’ll need to have some way to get the metadata updated in the reading processes. It sounds like you might try another 1.10.x addition, the single-writer multiple-reader. If each rank can open its own output file in read-write, and all the other ranks’ files in read-only, you can avoid the parallel overhead. I haven’t tried this approach, and you’ll have to be careful of race conditions and keep the file metadata correct in all the ranks, but it sounds like it might fit your parallel I/O model. https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

Jarom

*From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On Behalf Of *Chris Green
*Sent:* Monday, July 25, 2016 3:41 PM
*To:* HDF Users Discussion List
*Subject:* Re: [Hdf-forum] Parallel dataset resizing strategies

Hi,

Thanks for this. Comments inline.

On 7/22/16 12:13 PM, Nelson, Jarom wrote:

If you can move to HDF5 1.10, I would recommend independent files for each MPI rank, and then create a master file (created independently perhaps by rank 0) with Virtual Datasets linking in the data from each rank in the format you need. Virtual Datasets can be created with file matching patterns for dynamically increasing datasets, so you might look into using that feature.

We don't have existing tools relying on a particular version, so we are nominally free to move to HDF5 1.10.x. However, it won't be completely straightforward because I have been relying for now on using the homebrew version, which is currently 1.18.16. I'd have to dink the recipe to use 1.10.x, which is not a showstopper.

    I found this approach much faster than creating a collective file
    (~5-10x speedup on a Lustre filesystem). You don’t need to do any
    collective reads or writes, and I think we could even bypass using
    parallel HDF5 altogether. Note, this will only work if you only
    ever need to open the Virtual Dataset in parallel (i.e. by more
    than one process) as non-collective read-only. If you need to have
    read-write access to the master file, you can’t access a Virtual
    Dataset using collective operations. You can, however, have as
    many processes as you like read from a virtual dataset from a file
    opened as read-only.

    If you have other tools that use your data but can’t move to HDF5
    1.10, you can h5repack a file with Virtual Datasets to remove the
    Virtual Datasets, and it should be compatible with HDF5 1.8 (use
    h5repack from HDF5 1.10 patch 1 or later). This also worked well
    for us and I was able to load a repacked file in IDL under a 1.8
    HDF5 library. However h5repack is not a parallel application, so
    it can be slow to repack a very large file, on the order minutes
    per GB.

After having thought a little more about likely parallel models, I think now we can arrange that:

·Only one rank will write to a particular dataset.

·A dataset will not be read from in the same job in which it was written.

·A dataset may be read by one or more ranks.

I *think* if that's the case, we could use a hierarchical multi-file format without resorting to virtual datasets, no? I still have some reading and experimenting to do, but if you have particular information that would speak to the likely success of this approach, I'd be happy to hear it.

Thanks,

Chris.

    Jarom

    *From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On
    Behalf Of *Chris Green
    *Sent:* Friday, July 22, 2016 9:32 AM
    *To:* hdf-forum@lists.hdfgroup.org
    <mailto:hdf-forum@lists.hdfgroup.org>
    *Subject:* [Hdf-forum] Parallel dataset resizing strategies

    Hi,

    I am relatively new to HDF5 and HDF5/parallel, and although I have
    experience with MPI it is not extensive. We are exploring ways of
    saving data in parallel using HDF5 in a field in which it is
    practically unknown up to now.

    Our paradigm is "parallel modular event processing:"

      * A typical job processes many "events."
      * An event contains all of the interesting data (raw and
        processed) associated with some time interval.
      * Each event can be processed independently of all other events.
      * Each event's data can be subdivided into internal components,
        "data products."
      * "Modules" are processing subunits which read or generate one
        or more data products for each event.
      * One can calculate a data dependency graph specifying the
        allowed ordering and/or parallelism of modules processing one
        or more events simultaneously for a given job configuration
        and event structure.

    We have been using h5py with HDF5 and OpenMPI to explore different
    strategies for parallel I/O in a future parallel event-processing
    framework. One of the approaches we have come up with so far is to
    have one HDF5 dataset per unique data product / writer module
    combination, keeping track of the different relevant sections of
    each dataset via (for now) an external database. This works well
    in serial tests, but in parallel tests we are running up against
    the constraint that dataset resizing is a collective operation,
    meaning that all ranks including non-writers will have to become
    aware of and duplicate dataset resizing operations required by
    other writers. The problem seems to get even worse if there's a
    possibility that two or more instances of a module would need to
    extend and write to the same dataset at the same time (while
    processing different events, say), since they will have to
    coordinate and agree on the new size of the dataset and their
    respective sections thereof.

    Are we misunderstanding the problem, or is it really this hard?
    Has anyone else hit upon a reasonable strategy for handling this
    or something like it?

    Any pointers appreciated.

    Thanks,

    Chris Green.

    --

    Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

    'phone (630) 840-2167; Skype: chris.h.green;

    IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

    chris.h.green (Google Talk).

    _______________________________________________

    Hdf-forum is for HDF software users discussion.

    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>

    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

    Twitter:https://twitter.com/hdf5

--
Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),
chris.h.green (Google Talk).

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Chris Green <greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM: greenc@jabber.fnal.gov, chissgreen (AIM),
chris.h.green (Google Talk).

Even if you are using independent datasets or groups from each process, the file metadata will need to be updated collectively for the HDF5 library to work correctly. Dataset identifiers, chunk locations, group information, attributes, etc. all need to be coordinated among all processes writing to the file for the file to have the correct data.

This page has a list of all the calls that need to be used collectively for the file to be written correctly.
https://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

Jarom

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Tuesday, July 26, 2016 8:33 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel dataset resizing strategies

Hi,
Thanks for continuing the conversation. I believe files will be *either* read from, *or* written to, but not both simultaneously, at least in the scenario I'm working on right now. I'd like to be able to write to the same file from different ranks simultaneously, but only to different datasets. If that's not possible without propagating dataset extension operations collectively to ranks not writing to that dataset, then I will start looking at the virtual dataset solution you suggested in your first reply.

Thanks again,
Chris.
On 7/25/16 7:41 PM, Nelson, Jarom wrote:
If you want to have multiple ranks write to the same file, you’ll need to open the file in read-write and use parallel HDF5 with the associated overhead and complexity of the collective calls. I think the only way to avoid the overhead of the collective calls is to open separate files for each rank.

If you are going to have a multi-file approach, and read from files which are open in write mode by another process, you’ll need to have some way to get the metadata updated in the reading processes. It sounds like you might try another 1.10.x addition, the single-writer multiple-reader. If each rank can open its own output file in read-write, and all the other ranks’ files in read-only, you can avoid the parallel overhead. I haven’t tried this approach, and you’ll have to be careful of race conditions and keep the file metadata correct in all the ranks, but it sounds like it might fit your parallel I/O model. https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

Jarom

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Monday, July 25, 2016 3:41 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel dataset resizing strategies

Hi,

Thanks for this. Comments inline.
On 7/22/16 12:13 PM, Nelson, Jarom wrote:

If you can move to HDF5 1.10, I would recommend independent files for each MPI rank, and then create a master file (created independently perhaps by rank 0) with Virtual Datasets linking in the data from each rank in the format you need. Virtual Datasets can be created with file matching patterns for dynamically increasing datasets, so you might look into using that feature.

We don't have existing tools relying on a particular version, so we are nominally free to move to HDF5 1.10.x. However, it won't be completely straightforward because I have been relying for now on using the homebrew version, which is currently 1.18.16. I'd have to dink the recipe to use 1.10.x, which is not a showstopper.
I found this approach much faster than creating a collective file (~5-10x speedup on a Lustre filesystem). You don’t need to do any collective reads or writes, and I think we could even bypass using parallel HDF5 altogether. Note, this will only work if you only ever need to open the Virtual Dataset in parallel (i.e. by more than one process) as non-collective read-only. If you need to have read-write access to the master file, you can’t access a Virtual Dataset using collective operations. You can, however, have as many processes as you like read from a virtual dataset from a file opened as read-only.

If you have other tools that use your data but can’t move to HDF5 1.10, you can h5repack a file with Virtual Datasets to remove the Virtual Datasets, and it should be compatible with HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or later). This also worked well for us and I was able to load a repacked file in IDL under a 1.8 HDF5 library. However h5repack is not a parallel application, so it can be slow to repack a very large file, on the order minutes per GB.

After having thought a little more about likely parallel models, I think now we can arrange that:

· Only one rank will write to a particular dataset.

· A dataset will not be read from in the same job in which it was written.

· A dataset may be read by one or more ranks.

I *think* if that's the case, we could use a hierarchical multi-file format without resorting to virtual datasets, no? I still have some reading and experimenting to do, but if you have particular information that would speak to the likely success of this approach, I'd be happy to hear it.

Thanks,

Chris.

Jarom

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Chris Green
Sent: Friday, July 22, 2016 9:32 AM
To: hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] Parallel dataset resizing strategies

Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience with MPI it is not extensive. We are exploring ways of saving data in parallel using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  * A typical job processes many "events."
  * An event contains all of the interesting data (raw and processed) associated with some time interval.
  * Each event can be processed independently of all other events.
  * Each event's data can be subdivided into internal components, "data products."
  * "Modules" are processing subunits which read or generate one or more data products for each event.
  * One can calculate a data dependency graph specifying the allowed ordering and/or parallelism of modules processing one or more events simultaneously for a given job configuration and event structure.
We have been using h5py with HDF5 and OpenMPI to explore different strategies for parallel I/O in a future parallel event-processing framework. One of the approaches we have come up with so far is to have one HDF5 dataset per unique data product / writer module combination, keeping track of the different relevant sections of each dataset via (for now) an external database. This works well in serial tests, but in parallel tests we are running up against the constraint that dataset resizing is a collective operation, meaning that all ranks including non-writers will have to become aware of and duplicate dataset resizing operations required by other writers. The problem seems to get even worse if there's a possibility that two or more instances of a module would need to extend and write to the same dataset at the same time (while processing different events, say), since they will have to coordinate and agree on the new size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

--

Chris Green <greenc@fnal.gov><mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: greenc@jabber.fnal.gov<mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

chris.h.green (Google Talk).

Thanks for this, Jarom,

Based on some subsequent experiments, I think we're going with multiple files with a "master" file containing external links (not virtual, as a dataset is confined to a single file), and (possibly) an "extents" dataset describing where the data can be foun dfor each event in each data product. Life would be easier if I could send region references to the master file and have them "just work" with respect to the external links, but I haven't done the experiment yet (although I do have a forum post asking about this, see '"Re-seating" region references for external links?').

Thanks again for your posts -- they have helped us narrow in on a promising track.

Best,

Chris.

···

On 7/26/16 11:05 AM, Nelson, Jarom wrote:

Even if you are using independent datasets or groups from each process, the file metadata will need to be updated collectively for the HDF5 library to work correctly. Dataset identifiers, chunk locations, group information, attributes, etc. all need to be coordinated among all processes writing to the file for the file to have the correct data.

This page has a list of all the calls that need to be used collectively for the file to be written correctly.

https://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

Jarom

*From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On Behalf Of *Chris Green
*Sent:* Tuesday, July 26, 2016 8:33 AM
*To:* HDF Users Discussion List
*Subject:* Re: [Hdf-forum] Parallel dataset resizing strategies

Hi,

Thanks for continuing the conversation. I believe files will be *either* read from, *or* written to, but not both simultaneously, at least in the scenario I'm working on right now. I'd like to be able to write to the same file from different ranks simultaneously, but only to different datasets. If that's not possible without propagating dataset extension operations collectively to ranks not writing to that dataset, then I will start looking at the virtual dataset solution you suggested in your first reply.

Thanks again,
Chris.

On 7/25/16 7:41 PM, Nelson, Jarom wrote:

    If you want to have multiple ranks write to the same file, you’ll
    need to open the file in read-write and use parallel HDF5 with the
    associated overhead and complexity of the collective calls. I
    think the only way to avoid the overhead of the collective calls
    is to open separate files for each rank.

    If you are going to have a multi-file approach, and read from
    files which are open in write mode by another process, you’ll need
    to have some way to get the metadata updated in the reading
    processes. It sounds like you might try another 1.10.x addition,
    the single-writer multiple-reader. If each rank can open its own
    output file in read-write, and all the other ranks’ files in
    read-only, you can avoid the parallel overhead. I haven’t tried
    this approach, and you’ll have to be careful of race conditions
    and keep the file metadata correct in all the ranks, but it sounds
    like it might fit your parallel I/O model.
    https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

    Jarom

    *From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On
    Behalf Of *Chris Green
    *Sent:* Monday, July 25, 2016 3:41 PM
    *To:* HDF Users Discussion List
    *Subject:* Re: [Hdf-forum] Parallel dataset resizing strategies

    Hi,

    Thanks for this. Comments inline.

    On 7/22/16 12:13 PM, Nelson, Jarom wrote:

    If you can move to HDF5 1.10, I would recommend independent files
    for each MPI rank, and then create a master file (created
    independently perhaps by rank 0) with Virtual Datasets linking in
    the data from each rank in the format you need. Virtual Datasets
    can be created with file matching patterns for dynamically
    increasing datasets, so you might look into using that feature.

    We don't have existing tools relying on a particular version, so
    we are nominally free to move to HDF5 1.10.x. However, it won't be
    completely straightforward because I have been relying for now on
    using the homebrew version, which is currently 1.18.16. I'd have
    to dink the recipe to use 1.10.x, which is not a showstopper.

        I found this approach much faster than creating a collective
        file (~5-10x speedup on a Lustre filesystem). You don’t need
        to do any collective reads or writes, and I think we could
        even bypass using parallel HDF5 altogether. Note, this will
        only work if you only ever need to open the Virtual Dataset in
        parallel (i.e. by more than one process) as non-collective
        read-only. If you need to have read-write access to the master
        file, you can’t access a Virtual Dataset using collective
        operations. You can, however, have as many processes as you
        like read from a virtual dataset from a file opened as read-only.

        If you have other tools that use your data but can’t move to
        HDF5 1.10, you can h5repack a file with Virtual Datasets to
        remove the Virtual Datasets, and it should be compatible with
        HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or later). This
        also worked well for us and I was able to load a repacked file
        in IDL under a 1.8 HDF5 library. However h5repack is not a
        parallel application, so it can be slow to repack a very large
        file, on the order minutes per GB.

    After having thought a little more about likely parallel models, I
    think now we can arrange that:

    ·Only one rank will write to a particular dataset.

    ·A dataset will not be read from in the same job in which it was
    written.

    ·A dataset may be read by one or more ranks.

    I *think* if that's the case, we could use a hierarchical
    multi-file format without resorting to virtual datasets, no? I
    still have some reading and experimenting to do, but if you have
    particular information that would speak to the likely success of
    this approach, I'd be happy to hear it.

    Thanks,

    Chris.

        Jarom

        *From:*Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org]
        *On Behalf Of *Chris Green
        *Sent:* Friday, July 22, 2016 9:32 AM
        *To:* hdf-forum@lists.hdfgroup.org
        <mailto:hdf-forum@lists.hdfgroup.org>
        *Subject:* [Hdf-forum] Parallel dataset resizing strategies

        Hi,

        I am relatively new to HDF5 and HDF5/parallel, and although I
        have experience with MPI it is not extensive. We are exploring
        ways of saving data in parallel using HDF5 in a field in which
        it is practically unknown up to now.

        Our paradigm is "parallel modular event processing:"

          * A typical job processes many "events."
          * An event contains all of the interesting data (raw and
            processed) associated with some time interval.
          * Each event can be processed independently of all other events.
          * Each event's data can be subdivided into internal
            components, "data products."
          * "Modules" are processing subunits which read or generate
            one or more data products for each event.
          * One can calculate a data dependency graph specifying the
            allowed ordering and/or parallelism of modules processing
            one or more events simultaneously for a given job
            configuration and event structure.

        We have been using h5py with HDF5 and OpenMPI to explore
        different strategies for parallel I/O in a future parallel
        event-processing framework. One of the approaches we have come
        up with so far is to have one HDF5 dataset per unique data
        product / writer module combination, keeping track of the
        different relevant sections of each dataset via (for now) an
        external database. This works well in serial tests, but in
        parallel tests we are running up against the constraint that
        dataset resizing is a collective operation, meaning that all
        ranks including non-writers will have to become aware of and
        duplicate dataset resizing operations required by other
        writers. The problem seems to get even worse if there's a
        possibility that two or more instances of a module would need
        to extend and write to the same dataset at the same time
        (while processing different events, say), since they will have
        to coordinate and agree on the new size of the dataset and
        their respective sections thereof.

        Are we misunderstanding the problem, or is it really this
        hard? Has anyone else hit upon a reasonable strategy for
        handling this or something like it?

        Any pointers appreciated.

        Thanks,

        Chris Green.

        --

        Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

        'phone (630) 840-2167; Skype: chris.h.green;

        IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

        chris.h.green (Google Talk).

        _______________________________________________

        Hdf-forum is for HDF software users discussion.

        Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>

        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

        Twitter:https://twitter.com/hdf5

    --

    Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;

    'phone (630) 840-2167; Skype: chris.h.green;

    IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),

    chris.h.green (Google Talk).

    _______________________________________________

    Hdf-forum is for HDF software users discussion.

    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>

    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

    Twitter:https://twitter.com/hdf5

--
Chris Green<greenc@fnal.gov> <mailto:greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM:greenc@jabber.fnal.gov <mailto:greenc@jabber.fnal.gov>, chissgreen (AIM),
chris.h.green (Google Talk).

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Chris Green <greenc@fnal.gov>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM: greenc@jabber.fnal.gov, chissgreen (AIM),
chris.h.green (Google Talk).