hdf5 file organization questions?

Steinberg_Peter · September 6, 2013, 7:28pm

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What's h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg

epourmal · September 8, 2013, 10:39pm

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What’s h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Steinberg_Peter · September 9, 2013, 1:55pm

I'm not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I've done showed the big time use in the H5Dread calls; I didn't profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What's h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

epourmal · September 10, 2013, 10:24pm

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

I’m not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I’ve done showed the big time use in the H5Dread calls; I didn’t profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What’s h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Steinberg_Peter · September 10, 2013, 10:34pm

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

I'm not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I've done showed the big time use in the H5Dread calls; I didn't profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What's h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

epourmal · September 11, 2013, 1:29pm

Please try the h5stat tool to see why there is a difference in file sizes.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 10, 2013, at 5:34 PM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

I’m not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I’ve done showed the big time use in the H5Dread calls; I didn’t profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What’s h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Steinberg_Peter · September 11, 2013, 1:40pm

Old File:
File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 49788198
                Summary of file space information:
  File metadata: 49795768 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 113561288 bytes
Total space: 4411531805 bytes

New File:
                File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 35848672
Summary of file space information:
  File metadata: 35856248 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 1872 bytes
Total space: 4284032869 bytes

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Wednesday, September 11, 2013 8:29 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Please try the h5stat tool to see why there is a difference in file sizes.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 10, 2013, at 5:34 PM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

I'm not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I've done showed the big time use in the H5Dread calls; I didn't profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What's h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

epourmal · September 12, 2013, 4:31pm

Peter,

Thank you for the info. Do you know if application writes the file at once or it closes and opens it periodically?

I don't have any explanation at this point why you get 4 time performance boost. We will need to reproduce the problem here to understand the issue. Any detailed information on how the original file was written will help.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 11, 2013, at 8:40 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

Old File:
File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 49788198
                Summary of file space information:
  File metadata: 49795768 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 113561288 bytes
Total space: 4411531805 bytes

New File:
                File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 35848672
Summary of file space information:
  File metadata: 35856248 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 1872 bytes
Total space: 4284032869 bytes

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Wednesday, September 11, 2013 8:29 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Please try the h5stat tool to see why there is a difference in file sizes.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 10, 2013, at 5:34 PM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

I’m not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I’ve done showed the big time use in the H5Dread calls; I didn’t profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
  Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What’s h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
  Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Steinberg_Peter · September 12, 2013, 4:41pm

The file is opened and closed periodically as it is being created.

The thread that writes the data to the file gets an indeterminate number of 1 x 1 x 1796 data chunks which it writes out via H5Sselect_hyperslab and H5Dwrite, then it closes the file and waits for more data.

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Thursday, September 12, 2013 11:31 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

Thank you for the info. Do you know if application writes the file at once or it closes and opens it periodically?

I don't have any explanation at this point why you get 4 time performance boost. We will need to reproduce the problem here to understand the issue. Any detailed information on how the original file was written will help.

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 11, 2013, at 8:40 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

Old File:
File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 49788198
                Summary of file space information:
  File metadata: 49795768 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 113561288 bytes
Total space: 4411531805 bytes

New File:
                File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 35848672
Summary of file space information:
  File metadata: 35856248 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 1872 bytes
Total space: 4284032869 bytes

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Wednesday, September 11, 2013 8:29 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Please try the h5stat tool to see why there is a difference in file sizes.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 10, 2013, at 5:34 PM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com<mailto:peter.steinberg@thermofisher.com>> wrote:

I'm not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I've done showed the big time use in the H5Dread calls; I didn't profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:forum-bounces@lists.hdfgroup.org>] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What's h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

epourmal · September 12, 2013, 10:23pm

Thank you! We will investigate.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 12, 2013, at 11:41 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

The file is opened and closed periodically as it is being created.

The thread that writes the data to the file gets an indeterminate number of 1 x 1 x 1796 data chunks which it writes out via H5Sselect_hyperslab and H5Dwrite, then it closes the file and waits for more data.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Thursday, September 12, 2013 11:31 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

Thank you for the info. Do you know if application writes the file at once or it closes and opens it periodically?

I don't have any explanation at this point why you get 4 time performance boost. We will need to reproduce the problem here to understand the issue. Any detailed information on how the original file was written will help.

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 11, 2013, at 8:40 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

Old File:
File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 49788198
                Summary of file space information:
  File metadata: 49795768 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 113561288 bytes
Total space: 4411531805 bytes

New File:
                File space information for file metadata (in bytes):
Chunked datasets:
                                Index: 35848672
Summary of file space information:
  File metadata: 35856248 bytes
  Raw data: 4248174749 bytes
  Unaccounted space: 1872 bytes
Total space: 4284032869 bytes

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Wednesday, September 11, 2013 8:29 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Please try the h5stat tool to see why there is a difference in file sizes.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 10, 2013, at 5:34 PM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

Original file: 4,411,531,805 bytes
Repacked file: 4,286,032,869 bytes

The dataset is 800 x 800 x 1796 floats (H5T_IEEE_F32LE).

The application writes out chunks of size 1 x 1 x 1796 as the data is acquired.

The data is read back in chunks of 1 x 800 x 1796 (earlier testing showed this gave the best performance while still allowing me to give useful read progress updates).

I had tried writing out the data in chunk of size 1 x 25 x 1796 (and various other values for the second index) but that was slower as I try to flush the data to disk after each 1 x 1 x 1796 data block to minimize data loss in case of hardware / software issues.

I can try rebuilding the HDF5 library and look for some way to profile it.

Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Tuesday, September 10, 2013 5:24 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Hi Peter,

My apology. I think I misunderstood your question.

h5repack does rearrange data in hdf5 file. For example, some pieces of HDF5 metadata that were "scattered" in the original file are in one block in the repacked file; chunks are allocated "one after another" and that may help with the disk access. But it is hard to say where the gain is without profiling the application with original and then repacked file.

Is the size of repacked file much smaller?
Would it be possible to describe how the file was written? How does the application access the file?

If you are not changing layout and compression parameters with h5repack, it is really surprising that h5repack helps so much. It is good to know that this may be an option

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 9, 2013, at 8:55 AM, "Steinberg, Peter" <peter.steinberg@thermofisher.com> wrote:

I’m not comparing directly with h5repack but with my application reading the dataset before and after running h5repack.

The quick profiling I’ve done showed the big time use in the H5Dread calls; I didn’t profile the internals of the HDF5 library.

The dataset (both before and after repacking) is H5T_IEEE_F32LE, 3 dimensional, 800 x 800 x 1796, with a chunk size of 1 x 1 x 1796 and compressed at deflate level 6.

Also, running a simple h5repack on the output from the first h5repack shows a similar speed increase (h5repack outfile outfile2).

Thanks,
  Peter

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Sunday, September 08, 2013 5:39 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf5 file organization questions?

Peter,

There are a few optimizations that h5repack does. For example, when rewriting chunked data h5repack uses H5Ocopy if applied filters and chunk sizes stay the same. It also uses hyperslab selections to coincide with the chunk boundaries and avoids datatype conversion if possible.

The comparison with h5repack may not be fair. For example, when an application reads compressed data time will be spend in decoding, while h5repack avoids the decoding step completely. Have you profiled your application to see where the time is spent?

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 6, 2013, at 2:28 PM, Steinberg, Peter wrote:

My applications save a good-sized hdf5 dataset.

It takes a fairly long time to read back in (around 4 minutes).

If I do a simple repack on the file (h5repack infile outfile) reading it back in only takes around 1 minute.

What’s h5repack doing that speeds up the reads so much, and how do I implement that in my application?

(I was using h5repack to test different chunking sizes, and everything I did in h5repack gave a similar time, including repacking to the same chunking scheme as the original dataset).

Thanks,
  Peter Steinberg
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

hdf5 file organization questions?