file.h5.tar.gz is 20 times smaller. what's wrong?

Hi everyone,

I am using HDF5 via h5py to store simulation data. The data are hierarchical
and I am using a nested tree of HDF5 Groups to store them. Each Group has
about 3 Datasets which are small, 3 Attributes, and a number <10 of
descendants.

My problem is that writing is kind of slow and the files are big. They also
seem very redundant since compressing the whole file with gzip gives almost
20x compression ratio while turning on gzip compression for the datasets has
almost no effect on file size. I also tried to set the new Group
compact/indexed storage format which reduces file size only a little.

Am I doing something wrong in the layout of the file? The actual data
hierarchy cannot be changed, but maybe I can rearrange data differently?

Here is a link to an example file if anyone would like to have a look:
http://dl.dropbox.com/u/5077634/br_0.h5.tar.gz (760k compressed, 3500
Groups, 7000 Datasets)

thx for any hints!

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2509949.html
Sent from the hdf-forum mailing list archive at Nabble.com.

You can check to see if the datasets are compressed using the h5dump command with the -p option:
h5dump -p -H filename.h5

Compressed datasets will have something like the following listed:
FILTERS {
         COMPRESSION DEFLATE { LEVEL 5 }
      }

Since you're not seeing a size decrease, you may want to double-check that the datasets are actually being compressed. I've made this mistake in the past.
Cheers,
-Corey

···

On Wed, 16 Feb 2011 06:52:15 -0800 (PST) nls <becker.nils@gmx.net> wrote:

Hi everyone,

I am using HDF5 via h5py to store simulation data. The data are hierarchical
and I am using a nested tree of HDF5 Groups to store them. Each Group has
about 3 Datasets which are small, 3 Attributes, and a number <10 of
descendants.

My problem is that writing is kind of slow and the files are big. They also
seem very redundant since compressing the whole file with gzip gives almost
20x compression ratio while turning on gzip compression for the datasets has
almost no effect on file size. I also tried to set the new Group
compact/indexed storage format which reduces file size only a little.

Am I doing something wrong in the layout of the file? The actual data
hierarchy cannot be changed, but maybe I can rearrange data differently?

Here is a link to an example file if anyone would like to have a look:
http://dl.dropbox.com/u/5077634/br_0.h5.tar.gz (760k compressed, 3500
Groups, 7000 Datasets)

thx for any hints!

Hi nls,

I am using HDF5 via h5py to store simulation data. The data are hierarchical
and I am using a nested tree of HDF5 Groups to store them. Each Group has
about 3 Datasets which are small, 3 Attributes, and a number <10 of
descendants.

My problem is that writing is kind of slow and the files are big. They also
seem very redundant since compressing the whole file with gzip gives almost
20x compression ratio while turning on gzip compression for the datasets has
almost no effect on file size. I also tried to set the new Group
compact/indexed storage format which reduces file size only a little.

Am I doing something wrong in the layout of the file? The actual data
hierarchy cannot be changed, but maybe I can rearrange data differently?

Perhaps you've got a problem similar to the one I asked about
here in October last year? I noticed that when creating a lot
of groups with only small data sets the files got rather large
compared to what I was expecting. The result of the discussion
was that creating a group isn't inexpensive and requires in the
order of one kilobyte. The friendly answer by Quincey Koziol can
be found here:

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2010-October/003801.html

I wouldn't be too surprised if the information stored for the
groups has more common patterns than data and thus is easier to
compress. Of course, I don't know if this has any relevance to
your problem, your description just rang some bell;-)

                         Best regards, Jens

···

On Wed, Feb 16, 2011 at 06:52:15AM -0800, nls wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Jens,

From running h5stat on the file, it looks like 2/3 of the file space (~8 mb) is taken up by dataset chunk indexes. Since the datasets are so small, it would probably be a good idea to store them as contiguous (or even compact) instead of chunked. The rest of the space is mostly object headers (~1.5 mb for groups, ~2 mb for datasets), which the new file format (H5Pset_libver_bounds(..., H5F_LIBVER_LATEST, H5F_LIBVER_LATEST)) should help with, if you aren't already doing that. I also noticed that half of the dataset object header space is unused, so repacking the file may help there.

Thanks,
-Neil

···

On 02/16/2011 01:34 PM, Jens Thoms Toerring wrote:

Hi nls,

On Wed, Feb 16, 2011 at 06:52:15AM -0800, nls wrote:

I am using HDF5 via h5py to store simulation data. The data are hierarchical
and I am using a nested tree of HDF5 Groups to store them. Each Group has
about 3 Datasets which are small, 3 Attributes, and a number<10 of
descendants.

My problem is that writing is kind of slow and the files are big. They also
seem very redundant since compressing the whole file with gzip gives almost
20x compression ratio while turning on gzip compression for the datasets has
almost no effect on file size. I also tried to set the new Group
compact/indexed storage format which reduces file size only a little.

Am I doing something wrong in the layout of the file? The actual data
hierarchy cannot be changed, but maybe I can rearrange data differently?

Perhaps you've got a problem similar to the one I asked about
here in October last year? I noticed that when creating a lot
of groups with only small data sets the files got rather large
compared to what I was expecting. The result of the discussion
was that creating a group isn't inexpensive and requires in the
order of one kilobyte. The friendly answer by Quincey Koziol can
be found here:

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2010-October/003801.html

I wouldn't be too surprised if the information stored for the
groups has more common patterns than data and thus is easier to
compress. Of course, I don't know if this has any relevance to
your problem, your description just rang some bell;-)

                          Best regards, Jens

Hi Neil,

···

On Wed, Feb 16, 2011 at 01:55:41PM -0600, Neil Fortner wrote:

Jens,

From running h5stat on the file, it looks like 2/3 of the file space
(~8 mb) is taken up by dataset chunk indexes. Since the datasets
are so small, it would probably be a good idea to store them as
contiguous (or even compact) instead of chunked.

Sorry, I'm a bit confused here. Is this is about the file of the
OP ("nls") or about the file created by the test program I posted
back in October? I don't remember using chunked data sets when I
was asking the question I wrote about. But then it's quite a bit
of time ago and my memory may play tricks on me...

                             Best regards, Jens
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi everyone,

thanks for the helpful comments.

I did check that
1. compression is actually on
2. i am using the new 1.8 group format
(this actually required me t write my first nontrivial cython wrapper since
h5py does not provide access to LIBVER_LATEST)

Following the helpful advice on the chunk index overhead I tried to use
contiguous storage. Unfortunately, again I
ran into an unsupported feature: h5py only supports resizable Datasets when
they're chunked, even when using the "low-level" functions which wrap the
HDF5 C-API:

"h5py._stub.NotImplementedError: Extendible contiguous non-external dataset
(Dataset: Feature is unsupported)"

Since I do need resizing, I guess I am stuck with chunked Datasets for now.
I tried different chunk sizes but that did not make a noticeable difference.

In conclusion, I see no way to get less than about 15x file size overhead
when using HDF5 with h5py for my data....

cheers, Nils

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2520498.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Jens,

Oops, sorry, I meant to address it to Nils. I should have paid closer attention.

-Neil

···

On 02/16/2011 02:45 PM, Jens Thoms Toerring wrote:

Hi Neil,

On Wed, Feb 16, 2011 at 01:55:41PM -0600, Neil Fortner wrote:

Jens,

From running h5stat on the file, it looks like 2/3 of the file space
(~8 mb) is taken up by dataset chunk indexes. Since the datasets
are so small, it would probably be a good idea to store them as
contiguous (or even compact) instead of chunked.

Sorry, I'm a bit confused here. Is this is about the file of the
OP ("nls") or about the file created by the test program I posted
back in October? I don't remember using chunked data sets when I
was asking the question I wrote about. But then it's quite a bit
of time ago and my memory may play tricks on me...

                              Best regards, Jens

Not that this is very helpful but I thought the requirement that
'resizable' datasets be chunked came from the HDF5 library itself, not
anything above it such as h5py.

···

On Thu, 2011-02-17 at 08:45, nls wrote:

Hi everyone,

thanks for the helpful comments.

I did check that
1. compression is actually on
2. i am using the new 1.8 group format
(this actually required me t write my first nontrivial cython wrapper since
h5py does not provide access to LIBVER_LATEST)

Following the helpful advice on the chunk index overhead I tried to use
contiguous storage. Unfortunately, again I
ran into an unsupported feature: h5py only supports resizable Datasets when
they're chunked, even when using the "low-level" functions which wrap the
HDF5 C-API:

"h5py._stub.NotImplementedError: Extendible contiguous non-external dataset
(Dataset: Feature is unsupported)"

Since I do need resizing, I guess I am stuck with chunked Datasets for now.
I tried different chunk sizes but that did not make a noticeable difference.

In conclusion, I see no way to get less than about 15x file size overhead
when using HDF5 with h5py for my data....

cheers, Nils

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Hi Nils,

Hi everyone,

thanks for the helpful comments.

I did check that
1. compression is actually on
2. i am using the new 1.8 group format
(this actually required me t write my first nontrivial cython wrapper since
h5py does not provide access to LIBVER_LATEST)

Following the helpful advice on the chunk index overhead I tried to use
contiguous storage. Unfortunately, again I
ran into an unsupported feature: h5py only supports resizable Datasets when
they're chunked, even when using the "low-level" functions which wrap the
HDF5 C-API:

"h5py._stub.NotImplementedError: Extendible contiguous non-external dataset
(Dataset: Feature is unsupported)"

Since I do need resizing, I guess I am stuck with chunked Datasets for now.
I tried different chunk sizes but that did not make a noticeable difference.

In conclusion, I see no way to get less than about 15x file size overhead
when using HDF5 with h5py for my data....

Hmmm, doesn't make sense. I think we can do better.

Would it be possible for you to write a C program that does the same thing as your Python script?
Can you send us output of "h5dump -H -p" on your file? Also, could you please run h5stat on the file and post that output too?

Thank you!

Elena

···

On Feb 17, 2011, at 10:45 AM, nls wrote:

cheers, Nils
--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2520498.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Not that this is very helpful but I thought the requirement that
'resizable' datasets be chunked came from the HDF5 library itself, not
anything above it such as h5py.

yes, absolutely true.

Chunking is required when filters and/or "resizable" features are used for storing datasets.

Elena

···

On Feb 17, 2011, at 10:58 AM, Mark Miller wrote:

On Thu, 2011-02-17 at 08:45, nls wrote:

Hi everyone,

thanks for the helpful comments.

I did check that
1. compression is actually on
2. i am using the new 1.8 group format
(this actually required me t write my first nontrivial cython wrapper since
h5py does not provide access to LIBVER_LATEST)

Following the helpful advice on the chunk index overhead I tried to use
contiguous storage. Unfortunately, again I
ran into an unsupported feature: h5py only supports resizable Datasets when
they're chunked, even when using the "low-level" functions which wrap the
HDF5 C-API:

"h5py._stub.NotImplementedError: Extendible contiguous non-external dataset
(Dataset: Feature is unsupported)"

Since I do need resizing, I guess I am stuck with chunked Datasets for now.
I tried different chunk sizes but that did not make a noticeable difference.

In conclusion, I see no way to get less than about 15x file size overhead
when using HDF5 with h5py for my data....

cheers, Nils

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Elena,

as was pointed out, I was unfair in blaming h5py. It turns out that HDF5
supports unlimited dimensions only in chunked Datasets.

Would it be possible for you to write a C program that does the same thing
as your Python script?

No, unfortunately. I don't know enough C to do that and don't have the time
to learn it now.

Can you send us output of "h5dump -H -p" on your file? Also, could you
please run h5stat on the file and post that output too?

Neil Fortner was already kind enough to do that by following my link to the
file in the OP, see his post
http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2512318.html

Another idea I had was to make only one big Dataset which contains all of
the actual data, and then using the hierarchy of Groups in the file only to
store region references to the big Dataset. In that way, I could possibly
avoid the overhead of chunk indexes for the small portions of data that are
currently spread all over the nested Groups. Does that sound like a good
idea?

Nils

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2525464.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi,

just to report some partial progress: I have changed my program to produce
contiguous instead of chunked Datasets on each of the many Group nodes; the
ability to resize was not really necessary after I introduced an in-memory
cache for the data to be written.

Result: file size decreased to 2/3, so only 5x overhead now. Much better.
Incidentally, there seems to be no speed difference for writing between
chunked and contiguous. This probably means that my bottleneck for writing
is not the raw disk speed.

cheers, Nils

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2568080.html
Sent from the hdf-forum mailing list archive at Nabble.com.

sorry, I meant file size is down _by_ 2/3.

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/file-h5-tar-gz-is-20-times-smaller-what-s-wrong-tp2509949p2568202.html
Sent from the hdf-forum mailing list archive at Nabble.com.