Vlen compression

marcialieec · October 5, 2011, 3:22pm

Hi!

I'm using HDF5 to create an intermediate storage stage for an ESA project.
Currently our HDF5 file structure is the following:

/Dset1 (with a compound datatype)
/Dset2 (vlen datatype)
/Dset3 (vlen datatype)

Most of the data is in datasets 2 and 3.

We are interested in using the HDF5 chunking and compression features.
However, although we have activated the GZIP compression the file almost
does not get compressed at all. I've been searching and I found some posts
saying that compression in vlen is only acting on the "references" of the
vlen dataset.

My questions are:

Is this still the actual situation? Is there a plan to have this
"solved/modified" in the near future?
What are our other options if vlen datasets cannot be compressed? Assuming
our data has a variable length nature, of course.

Thanks!
Marcial

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Vlen-compression-tp3396836p3396836.html
Sent from the hdf-forum mailing list archive at Nabble.com.

derobins · October 5, 2011, 3:50pm

Hi Marcial,

Hi!

I'm using HDF5 to create an intermediate storage stage for an ESA project.
Currently our HDF5 file structure is the following:

/Dset1 (with a compound datatype)
/Dset2 (vlen datatype)
/Dset3 (vlen datatype)

Most of the data is in datasets 2 and 3.

We are interested in using the HDF5 chunking and compression features.
However, although we have activated the GZIP compression the file almost
does not get compressed at all. I've been searching and I found some posts
saying that compression in vlen is only acting on the "references" of the
vlen dataset.

My questions are:

Is this still the actual situation? Is there a plan to have this
"solved/modified" in the near future?
What are our other options if vlen datasets cannot be compressed? Assuming
our data has a variable length nature, of course.

Yes, this is still the case. I'm pretty sure this is not going to be
fixed anytime soon since it involves major library changes, though
it's something we'd like to do.

You have two other options:

1) If your data is only a little variable, you can create a regular
chunked and compressed dataset that is guaranteed to fit any
reasonable data, and store your data with some trailing empty space.
The compression will usually efficiently handle the trailing space,
even if it's quite large. The downside is that guessing a good size
can be difficult and you'll have to come up with a scheme for handling
data that exceed the pre-guessed fixed size. If you have a situation
where you can pre-scan the input, you can always compute the fixed
sizes to use.

2) Store your data concatenated in a 1D dataset and use a second
dataset of start/end indexes to get the data for each element. This
is basically how the HDF5 variable-length datatype works; you are just
implementing it at a higher level using the public API. This probably
will result in a larger file than the first scheme that I mentioned,
and will have a slower access time but your mileage may vary so you'll
want to try both on realistic data. It does have the advantage of
handling any size data.

I've also kicked around the idea of a hybrid scheme that uses a
fixed-size dataset where the first bytes are interpreted as indexes
into a secondary concatenated dataset if a magic byte, bit in a bitmap
index, etc. is set, but this would be harder to implement in a nice
way using the public API.

Cheers,

Dana

···

On Wed, Oct 5, 2011 at 10:22 AM, marcialieec <marcialieec@gmail.com> wrote:

Thanks!
Marcial

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Vlen-compression-tp3396836p3396836.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

marcialieec · November 17, 2011, 11:55am

Hi Dana,

Thank you for your answer. For now we have managed to split the data into
datasets with the same data length. This has added some complexity to the
file structure and some side effects but its bearable for now.

Sorry for the delay but we are building quite a complex system to work in a
supercomputer so it took some time to assess the best approach.

Thanks again!
Marcial

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Vlen-compression-tp3396836p3515520.html
Sent from the hdf-forum mailing list archive at Nabble.com.