Writing Chunked Data in Parallel with Compression

Hi,

I apologize if this is a repeat question. I saw on the HDF5 website that writing chunked data in parallel with compression is not supported in version 1.6.3 (https://www.hdfgroup.org/hdf5-quest.html#p5comp). Has support been added since then?

To give some background, I'll briefly describe our data layout and needs. We have a 3D cartesian domain decomposed by a 2D MPI process layout. Each process owns an independent hyperslab of the 3D dataset, and all hyperslabs have the same dimensions. We would like to write the data collectively using a chunked layout to a single HDF5 file, with compression being applied to each chunk. The website mentions that compression is difficult to do in the case of independent IO. Could it be possible in this case, when IO is collective, and all hyperslabs are of equal dimension?

Thanks,
Matthew

I don't think this is yet possible with HDF5. You can only do compression in non-parallel settings.

I think there is some work afoot in HDF5 to start supporting certain types of compression (fixed rate but variable loss for example) in parallel.

The challenge is that in the presence of compression (in general), each chunk winds up being an unpredictable size and so predicting where chunks land in the file when they are contiguously packed next to each other requires additional communciation that doesn't easily fit within the the current library's design contraints. Each processor winds up needing to know about chunk sizes written by all other processors.

I've long argued for support for a 'target rate' though where, the library is told a compression filter must hit a target compression rate say 2:1. It then does all its work assuming each chunk is 1/2 the size of the orig. dataset. If some chunks compress more than 2:1, thats ok. They get pad bytes added so they are 2:1 (so you don't get any *more* advantage for these lucky chunks). If any chunk fails to compress 2:1, the whole write operation fails. But, even that latter bit of logic is a little hard to handle in the current parallel library.

Mark

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of Matthew Clay <mpclay@gmail.com<mailto:mpclay@gmail.com>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Saturday, August 15, 2015 3:50 PM
To: "hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>" <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Subject: [Hdf-forum] Writing Chunked Data in Parallel with Compression

Hi,

I apologize if this is a repeat question. I saw on the HDF5 website that
writing chunked data in parallel with compression is not supported in
version 1.6.3 (https://www.hdfgroup.org/hdf5-quest.html#p5comp). Has
support been added since then?

To give some background, I'll briefly describe our data layout and
needs. We have a 3D cartesian domain decomposed by a 2D MPI process
layout. Each process owns an independent hyperslab of the 3D dataset,
and all hyperslabs have the same dimensions. We would like to write the
data collectively using a chunked layout to a single HDF5 file, with
compression being applied to each chunk. The website mentions that
compression is difficult to do in the case of independent IO. Could it
be possible in this case, when IO is collective, and all hyperslabs are
of equal dimension?

Thanks,
Matthew

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Mark,

Thank you very much for your response and explanation. We will continue to use parallel IO when we need it, and perhaps use process-based (serial) writing with compression enabled when we are worried about the overall storage size.

Thanks,
Matthew

···

On 08/17/2015 12:57 PM, Miller, Mark C. wrote:

I don't think this is yet possible with HDF5. You can only do
compression in non-parallel settings.

I think there is some work afoot in HDF5 to start supporting certain
types of compression (fixed rate but variable loss for example) in parallel.

The challenge is that in the presence of compression (in general), each
chunk winds up being an unpredictable size and so predicting where
chunks land in the file when they are contiguously packed next to each
other requires additional communciation that doesn't easily fit within
the the current library's design contraints. Each processor winds up
needing to know about chunk sizes written by all other processors.

I've long argued for support for a 'target rate' though where, the
library is told a compression filter must hit a target compression rate
say 2:1. It then does all its work assuming each chunk is 1/2 the size
of the orig. dataset. If some chunks compress more than 2:1, thats ok.
They get pad bytes added so they are 2:1 (so you don't get any *more*
advantage for these lucky chunks). If any chunk fails to compress 2:1,
the whole write operation fails. But, even that latter bit of logic is a
little hard to handle in the current parallel library.

Mark

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org
<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of Matthew Clay
<mpclay@gmail.com <mailto:mpclay@gmail.com>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org
<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Saturday, August 15, 2015 3:50 PM
To: "hdf-forum@lists.hdfgroup.org <mailto:hdf-forum@lists.hdfgroup.org>"
<hdf-forum@lists.hdfgroup.org <mailto:hdf-forum@lists.hdfgroup.org>>
Subject: [Hdf-forum] Writing Chunked Data in Parallel with Compression

    Hi,

    I apologize if this is a repeat question. I saw on the HDF5 website
    that
    writing chunked data in parallel with compression is not supported in
    version 1.6.3 (https://www.hdfgroup.org/hdf5-quest.html#p5comp). Has
    support been added since then?

    To give some background, I'll briefly describe our data layout and
    needs. We have a 3D cartesian domain decomposed by a 2D MPI process
    layout. Each process owns an independent hyperslab of the 3D dataset,
    and all hyperslabs have the same dimensions. We would like to write the
    data collectively using a chunked layout to a single HDF5 file, with
    compression being applied to each chunk. The website mentions that
    compression is difficult to do in the case of independent IO. Could it
    be possible in this case, when IO is collective, and all hyperslabs are
    of equal dimension?

    Thanks,
    Matthew

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5