HDF5 file 4x larger than ascii?

werner · June 26, 2009, 12:57am

Now this is interesting: I got an HDF5 file which is 4x larger than
its corresponding representation as "h5ls -rvd" or "h5dump".

The HDF5 file is 6MB, and available here:

http://sciviz.cct.lsu.edu/data/h5path/path1.f5

Its output by "h5ls -rvd" is 1.4MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5ls

And "h5dump" on same file brings it to 1.5MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5dump

I'm aware that this kind of data layout is inefficient for the
data stored here; it consists of a time series of just three points
at each time step, each of them stored in some subgroups.

However, I did not expect it to be *that* inefficient such that the
ascii dump is 4x smaller than the corresponding binary HDF5 file
(using HDF5 1.8.2-post13).

It's not really a performance issue here, since the data file is
still small, and the layout is intended for really large data where
this metadata overhead will become neglible. Still I'm wondering if
there would be a "sufficiently easy" way to reduce the file size
significantly? Maybe there is some "pack all metadata together" property
setting or similar?

Cheers,
Werner

···

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

r.m.van.hees · June 26, 2009, 8:25am

Hi Werner,

Sorry, but I am not surprised, nor do I find it interesting. The
programs h5ls and h5dump simply do not show the incredible amount of the
duplicated meta data that you are storing in the HDF5 file. You wrote:
"I did not expect it (= HDF5) to be *that* inefficient"; HDF5 is not
inefficient, but simply writes to a file what you have asked it to do.

To improve your layout:
* Ask your self: "do I need the definition of the compound "point"?" If
so then write the definition of "point" to the root of the HDF5 file, or
even store the whole group Chart to the root, only once.
* Likely you do not need all the meta-data overhead if you switch to
the Packet Table API (H5PT) and write date and point into a table.
Alternatively, you could use the older Table API (H5TB). There are nice
example available on the HDF5 website

Good luck.

Richard

Werner Benger wrote:

···

Now this is interesting: I got an HDF5 file which is 4x larger than
its corresponding representation as "h5ls -rvd" or "h5dump".

The HDF5 file is 6MB, and available here:

http://sciviz.cct.lsu.edu/data/h5path/path1.f5

Its output by "h5ls -rvd" is 1.4MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5ls

And "h5dump" on same file brings it to 1.5MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5dump

I'm aware that this kind of data layout is inefficient for the
data stored here; it consists of a time series of just three points
at each time step, each of them stored in some subgroups.

However, I did not expect it to be *that* inefficient such that the
ascii dump is 4x smaller than the corresponding binary HDF5 file (using HDF5 1.8.2-post13).

It's not really a performance issue here, since the data file is
still small, and the layout is intended for really large data where
this metadata overhead will become neglible. Still I'm wondering if
there would be a "sufficiently easy" way to reduce the file size
significantly? Maybe there is some "pack all metadata together" property
setting or similar?

Cheers,
Werner

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

werner · June 26, 2009, 2:48pm

Hi Richard,

the definition of the compound type "point" resides in /Charts/Cartesian3D -
would you indeed expect it to make a significant difference if that type
is instead defined in the root of the HDF5 file? It occurs only once; all the
datasets later don't repeat this definition, but use it as a shared datatype.
It would expect it to be just h5ls to display its definition repeatedly,
but as in h5dump, it would rather be a link to the original definition.

The local "Charts" group at each timestep contains not more than a subgroup
with a softlink, but is otherwise empty. Admitted, this could be optimized
by linking all the Charts to be exactly the same, like the first one, if
that is indeed the culprit.

It might be that just creating a subgroup adds something like a 1024kB block
to the HDF5 file; that is what I did not expect, as I had rather expected a
subgroup definition being somehow compacted with others, such that e.g.
10 subgroup definitions would share the same space as one, as long as they
would fit all into one 1024kB block on disk. This of course might have been
wishful thinking, as I don't know the internal HDF5 file layout well enough.

This file is just a particular case, where three points are stored in a time
series. Indeed, that could fit well into a table. The purpose of this layout
is however that it should be compatible with more general cases, where the
number of points may change over time and there are much more points. I
have files of hundreds of time series with millions of points at each time
step, several hundred GB in size, where this layout is most efficient and the
metadata overhead is neglible. That's the really important case, storing the
data as a table would probably be inefficient in this case - though I did
not try storing 300GB with variable length of each row in a table.

I still have the impression that HDF5 writes more to the file than I asked
it to do, such as creating (maybe?) a 1kB block for each subgroup, which
by itself would just be a short string. Therefore just wondering if there
would be a way to request some "metadata compacting mode" that allows to
store multiple group definitions in the same block, instead of creating
a new one for each group. Which is just a speculation that this might be
the reason for the file size here.

Btw., the definition of the type "point" is required to distinguish
among different kinds of "3-tupels" of floats. In this special case,
there are only points. But there could also be tangential vectors,
co-vectors, matrix rows, matrix columns, etc.. All of them consisting
of just three float's in memory, but having algebraically different
properties, which should be detectable on reading the file. So that
is why there is this type definition in the file and not just three
float's. But it would surprise me still if such a type definition
leads to a boost in file size.

If h5ls/h5dump don't show duplicated metadata, is there another tool
that would allow to identify such?

Werner

···

On Fri, 26 Jun 2009 03:25:08 -0500, Richard van Hees <R.M.van.Hees@sron.nl> wrote:

Hi Werner,

Sorry, but I am not surprised, nor do I find it interesting. The
programs h5ls and h5dump simply do not show the incredible amount of the
duplicated meta data that you are storing in the HDF5 file. You wrote:
"I did not expect it (= HDF5) to be *that* inefficient"; HDF5 is not
inefficient, but simply writes to a file what you have asked it to do.

To improve your layout:
* Ask your self: "do I need the definition of the compound "point"?" If
so then write the definition of "point" to the root of the HDF5 file, or
even store the whole group Chart to the root, only once.
* Likely you do not need all the meta-data overhead if you switch to
the Packet Table API (H5PT) and write date and point into a table.
Alternatively, you could use the older Table API (H5TB). There are nice
example available on the HDF5 website

Good luck.

Richard

Werner Benger wrote:

Now this is interesting: I got an HDF5 file which is 4x larger than
its corresponding representation as "h5ls -rvd" or "h5dump".

The HDF5 file is 6MB, and available here:

http://sciviz.cct.lsu.edu/data/h5path/path1.f5

Its output by "h5ls -rvd" is 1.4MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5ls

And "h5dump" on same file brings it to 1.5MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5dump

I'm aware that this kind of data layout is inefficient for the
data stored here; it consists of a time series of just three points
at each time step, each of them stored in some subgroups.

However, I did not expect it to be *that* inefficient such that the
ascii dump is 4x smaller than the corresponding binary HDF5 file
(using HDF5 1.8.2-post13).

It's not really a performance issue here, since the data file is
still small, and the layout is intended for really large data where
this metadata overhead will become neglible. Still I'm wondering if
there would be a "sufficiently easy" way to reduce the file size
significantly? Maybe there is some "pack all metadata together" property
setting or similar?

Cheers,
Werner

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 30, 2009, 3:10pm

Hi Werner,

Hi Richard,

the definition of the compound type "point" resides in /Charts/Cartesian3D -
would you indeed expect it to make a significant difference if that type
is instead defined in the root of the HDF5 file? It occurs only once; all the
datasets later don't repeat this definition, but use it as a shared datatype.
It would expect it to be just h5ls to display its definition repeatedly,
but as in h5dump, it would rather be a link to the original definition.

The local "Charts" group at each timestep contains not more than a subgroup
with a softlink, but is otherwise empty. Admitted, this could be optimized
by linking all the Charts to be exactly the same, like the first one, if
that is indeed the culprit.

It might be that just creating a subgroup adds something like a 1024kB block
to the HDF5 file; that is what I did not expect, as I had rather expected a
subgroup definition being somehow compacted with others, such that e.g.
10 subgroup definitions would share the same space as one, as long as they
would fit all into one 1024kB block on disk. This of course might have been
wishful thinking, as I don't know the internal HDF5 file layout well enough.

This file is just a particular case, where three points are stored in a time
series. Indeed, that could fit well into a table. The purpose of this layout
is however that it should be compatible with more general cases, where the
number of points may change over time and there are much more points. I
have files of hundreds of time series with millions of points at each time
step, several hundred GB in size, where this layout is most efficient and the
metadata overhead is neglible. That's the really important case, storing the
data as a table would probably be inefficient in this case - though I did
not try storing 300GB with variable length of each row in a table.

I still have the impression that HDF5 writes more to the file than I asked
it to do, such as creating (maybe?) a 1kB block for each subgroup, which
by itself would just be a short string. Therefore just wondering if there
would be a way to request some "metadata compacting mode" that allows to
store multiple group definitions in the same block, instead of creating
a new one for each group. Which is just a speculation that this might be
the reason for the file size here.

Btw., the definition of the type "point" is required to distinguish
among different kinds of "3-tupels" of floats. In this special case,
there are only points. But there could also be tangential vectors,
co-vectors, matrix rows, matrix columns, etc.. All of them consisting
of just three float's in memory, but having algebraically different
properties, which should be detectable on reading the file. So that
is why there is this type definition in the file and not just three
float's. But it would surprise me still if such a type definition
leads to a boost in file size.

Hmm, try using the "latest" file format, with the H5Pset_libver_bounds(fapl, <LATEST>, <LATEST>). The "classic" format is less space efficient for many use cases. (Of course, tools that can't be updated to use the 1.8.x library won't be able to understand the file)

If h5ls/h5dump don't show duplicated metadata, is there another tool
that would allow to identify such?

You could try the 'h5stat' tool, although I don't think it's going to help much here.

Quincey

···

On Jun 26, 2009, at 9:48 AM, Werner Benger wrote:

Werner

On Fri, 26 Jun 2009 03:25:08 -0500, Richard van Hees <R.M.van.Hees@sron.nl > > wrote:

Hi Werner,

Sorry, but I am not surprised, nor do I find it interesting. The
programs h5ls and h5dump simply do not show the incredible amount of the
duplicated meta data that you are storing in the HDF5 file. You wrote:
"I did not expect it (= HDF5) to be *that* inefficient"; HDF5 is not
inefficient, but simply writes to a file what you have asked it to do.

To improve your layout:
* Ask your self: "do I need the definition of the compound "point"?" If
so then write the definition of "point" to the root of the HDF5 file, or
even store the whole group Chart to the root, only once.
* Likely you do not need all the meta-data overhead if you switch to
the Packet Table API (H5PT) and write date and point into a table.
Alternatively, you could use the older Table API (H5TB). There are nice
example available on the HDF5 website

Good luck.

Richard

Werner Benger wrote:

Now this is interesting: I got an HDF5 file which is 4x larger than
its corresponding representation as "h5ls -rvd" or "h5dump".

The HDF5 file is 6MB, and available here:

http://sciviz.cct.lsu.edu/data/h5path/path1.f5

Its output by "h5ls -rvd" is 1.4MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5ls

And "h5dump" on same file brings it to 1.5MB:

http://sciviz.cct.lsu.edu/data/h5path/path1.h5dump

I'm aware that this kind of data layout is inefficient for the
data stored here; it consists of a time series of just three points
at each time step, each of them stored in some subgroups.

However, I did not expect it to be *that* inefficient such that the
ascii dump is 4x smaller than the corresponding binary HDF5 file
(using HDF5 1.8.2-post13).

It's not really a performance issue here, since the data file is
still small, and the layout is intended for really large data where
this metadata overhead will become neglible. Still I'm wondering if
there would be a "sufficiently easy" way to reduce the file size
significantly? Maybe there is some "pack all metadata together" property
setting or similar?

Cheers,
Werner

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 file 4x larger than ascii?