Compressing small datasets

N11 · January 9, 2009, 1:02pm

Hi All,

I'm constructing a file which contains many small datasets (2d arrays of
floats, approximately 148floats per array).

I'd like to compress this data in the hdf5 file to save space, I've
tried running it through h5repack (h5repack -v -f GZIP=1 file1 file2)
however I get no compression. I'm assuming this is because the datasets
are too small to compress.

Compressing the whole file with gzip gives x10 compression.

Does anybody know of a workaround for compressing small datasets? Or a
method of compressing the entire file and being able to access it via
hdf5 in it's compressed form (a la zlib).

Suggestions welcome.

Cheers,

N

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · January 9, 2009, 1:59pm

Hi Nava,

···

On Jan 9, 2009, at 7:02 AM, new@sgenomics.org wrote:

Hi All,

I'm constructing a file which contains many small datasets (2d arrays of
floats, approximately 148floats per array).

I'd like to compress this data in the hdf5 file to save space, I've
tried running it through h5repack (h5repack -v -f GZIP=1 file1 file2)
however I get no compression. I'm assuming this is because the datasets
are too small to compress.

Compressing the whole file with gzip gives x10 compression.

Does anybody know of a workaround for compressing small datasets? Or a
method of compressing the entire file and being able to access it via
hdf5 in it's compressed form (a la zlib).

Using "GZIP=1" gives almost no compression with gzip, try using "GZIP=9" and see what the results are like. If you are using the 1.8.x versions of HDF5, try using the "-L" flag for h5repack also, to use the more efficient storage options for groups available in the 1.8.x releases.

Quincey

Pedro_Vicente2 · January 9, 2009, 3:46pm

Hi All,

I'm constructing a file which contains many small datasets (2d arrays of
floats, approximately 148floats per array).

I'd like to compress this data in the hdf5 file to save space, I've
tried running it through h5repack (h5repack -v -f GZIP=1 file1 file2)
however I get no compression. I'm assuming this is because the datasets
are too small to compress.

by default h5repack does not compress datasets smaller than 1024 bytes (your case). You can change that default with the -m option

-m M, --minimum=M Do not apply the filter to datasets smaller than M

try running with verbose mode first

h5repack -v -f GZIP=9 big.hdf5 big.hdf5.comp

it gives you a list of the datasets that were and were not compressed, and the reason why was not compressed (in this case because it's smaller than 1024 bytes)

then you can do

h5repack -v -m X -f GZIP=9 big.hdf5 big.hdf5.comp

where X is the size in bytes that you want h5repack to start applying compression

doing

h5repack -h

gives you the usage

Pedro

···

At 07:02 AM 1/9/2009, new@sgenomics.org wrote:

Compressing the whole file with gzip gives x10 compression.

Does anybody know of a workaround for compressing small datasets? Or a
method of compressing the entire file and being able to access it via
hdf5 in it's compressed form (a la zlib).

Suggestions welcome.

Cheers,

N

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

N11 · January 9, 2009, 2:12pm

Using "GZIP=1" gives almost no compression with gzip, try using
"GZIP=9" and see what the results are like. If you are using the 1.8.x
versions of HDF5, try using the "-L" flag for h5repack also, to use the
more efficient storage options for groups available in the 1.8.x
releases.

I'm using h5repack version 1.8.1.

With:

h5repack -L -f GZIP=9 big.hdf5 big.hdf5.comp

I get:

-rw-r--r-- 1 new new 105M Jan 9 13:08 big.hdf5
-rw-r--r-- 1 new new 99M Jan 9 14:02 big.hdf5.comp

compared with:

gzip big.fastprb.hdf5
-rw-r--r-- 1 new new 9.6M Jan 9 13:08 big.fastprb.hdf5.gz

So I'm guessing no compression is actually happening with h5repack. In
fact if I remove the -L option:

-rw-r--r-- 1 new new 109078560 Jan 9 13:08 big.hdf5
-rw-r--r-- 1 new new 109640800 Jan 9 14:10 big.hdf5.comp

The files are almost identical in size.

N

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

N11 · January 9, 2009, 4:00pm

by default h5repack does not compress datasets smaller than 1024 bytes (your case). You can change that default with the -m option

-m M, --minimum=M Do not apply the filter to datasets smaller than M

try running with verbose mode first

h5repack -v -f GZIP=9 big.hdf5 big.hdf5.comp

it gives you a list of the datasets that were and were not compressed, and the reason why was not compressed (in this case because it's smaller than 1024 bytes)

then you can do

h5repack -v -m X -f GZIP=9 big.hdf5 big.hdf5.comp

where X is the size in bytes that you want h5repack to start applying compression

Cheers.

That seems to apply the compression but:

h5repack -m 10 -L -f GZIP=9 big.hdf5 big.hdf5.comp

-rw-r--r-- 1 new new 105M Jan 9 13:08 big.hdf5
-rw-r--r-- 1 new new 331M Jan 9 15:56 big.hdf5.comp

I guess this is predictable as the gzip overhead is significant, are
there any other options other than gzipping the whole file and ungzing
it when I want to use it?

N

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Pedro_Vicente2 · January 9, 2009, 4:18pm

by default h5repack does not compress datasets smaller than 1024 bytes (your case). You can change that default with the -m option

-m M, --minimum=M Do not apply the filter to datasets smaller than M

try running with verbose mode first

h5repack -v -f GZIP=9 big.hdf5 big.hdf5.comp

it gives you a list of the datasets that were and were not compressed, and the reason why was not compressed (in this case because it's smaller than 1024 bytes)

then you can do

h5repack -v -m X -f GZIP=9 big.hdf5 big.hdf5.comp

where X is the size in bytes that you want h5repack to start applying compression

Cheers.

That seems to apply the compression but:

h5repack -m 10 -L -f GZIP=9 big.hdf5 big.hdf5.comp

-rw-r--r-- 1 new new 105M Jan 9 13:08 big.hdf5
-rw-r--r-- 1 new new 331M Jan 9 15:56 big.hdf5.comp

I guess this is predictable as the gzip overhead is significant, are
there any other options other than gzipping the whole file and ungzing
it when I want to use it?

no, h5repack doesn't have any other compression options, to achieve that x10 compression ratio, I believe the only way is to gzip the file

Pedro

···

At 10:00 AM 1/9/2009, new@sgenomics.org wrote:

N

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--------------------------------------------------------------
Pedro Vicente (T) 217.265-0311
pvn@hdfgroup.org
The HDF Group. 1901 S. First. Champaign, IL 61820

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

N11 · January 9, 2009, 4:41pm

no, h5repack doesn't have any other compression options, to achieve that x10 compression ratio, I believe the only way is to gzip the file

Are any available to me if I restructure my hdf5 file?

One I guess would be to create a large dataset and store indexes to that?

I guess the other option is to add another compression method using the
H5Z interface?

Or are there other methods I'm missing?

Thanks for you help,

N

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · January 9, 2009, 5:36pm

A Friday 09 January 2009, new@sgenomics.org escrigué:

> no, h5repack doesn't have any other compression options, to achieve
> that x10 compression ratio, I believe the only way is to gzip the
> file

Are any available to me if I restructure my hdf5 file?

You could try with szip which is supported by HDF5. However, my
impression is that you won't get too much compression either.

One I guess would be to create a large dataset and store indexes to
that?

That would help enormously indeed (see later).

I guess the other option is to add another compression method using
the H5Z interface?

Well, one of the niceties of the HDF5 format is that it adds information
about the data types. Knowing that, shuffle is able to pre-condition
(reorder) the data so that compressors can work better. Normally, I'm
used to see around a 2x improvement in compression rate when using it.
Give it a try and see if it helps you.

Or are there other methods I'm missing?

Well, there are many methods that you could use, but you must help HDF5
a bit. For example, can you sort your datasets before saving them?
With this, shuffle+zlib can do a very good job.

But frankly, given your data filesize and comparing it with such small
datasets, I suspect that most of the space in file is gone in keeping
the HDF5 metadata for every dataset. Mmm, I'm afraid that, if you
really want compression to take any positive effect on your data
filesize, you should drastically reduce the number of datasets and
making larger ones.

Hope that helps,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Compressing small datasets