Best way to compress a bunch of floats?

Hi all,

I’m to save a matrix of 40000x145 single precision floating point
numbers and I’m trying to figure out the best way to store them
compressed. By “best” I mostly mean the way that saves more storage
space without terrible performance loss. I’m guessing some storage
savings are possible since those number all fall into the range of
[-100.0,+100.0].

I’m not getting much luck. These are the different file sizes I see
with GZIP and SZIP:

No compression -> 22802KB
GZIP (deflate, level 6) -> 21247KB
SZIP -> 22812KB

Chunk dimensions are set to (40000,1), tried different combinations
without much difference. On SZIP I’m using H5Pset_szip(
HDF5Constants.H5_SZIP_NN_OPTION_MASK, 8). I’m calling the HDF5 library
from Java.

Are these results common? Somewhat I’m disappointed, I was expecting a
greater size reduction. I’m not an expert on HDF5 or compression
algorithms. If anyone can give me any ideas how to improve these
numbers, I’d appreciate it.

Kind regards,

  Ed

No compression -> 22802KB
GZIP (deflate, level 6) -> 21247KB
SZIP -> 22812KB

If you haven't already, you can try using the shuffle filter in
conjunction with deflate. This can significantly increase performance
of gzip-like compressors, especially for floating-point data. It
works by rearranging the data according to byte significance; for
example, an array of three 2-byte ints xyxyxy becomes xxxyyy. Be sure
to enable the shuffle filter first, and then the deflate filter.

Are these results common?

While the exact results will depend on your particular data, I can
tell you that I routinely see size reductions of 25-50% for
experiment-derived floating point data with deflate, and sometimes
more for integer and string data. However, your data may simply be
very hard to compress, even if it's in a fixed numerical range.

Andrew

Ed,

Is your data perchance oscillatory (like an audio file) where there is
little or no redundancy with adjacent values? Because this kind of
data will not compress with gzip/szip algorithms. This is where FLAC
comes from. I am just guessing here but your data range makes me thing
of sin/cos waves.

Leigh

···

On Thu, Jul 9, 2009 at 5:34 PM, Ed<ed.fuentetaja@gmail.com> wrote:

Hi all,

I’m to save a matrix of 40000x145 single precision floating point
numbers and I’m trying to figure out the best way to store them
compressed. By “best” I mostly mean the way that saves more storage
space without terrible performance loss. I’m guessing some storage
savings are possible since those number all fall into the range of
[-100.0,+100.0].

I’m not getting much luck. These are the different file sizes I see
with GZIP and SZIP:

No compression -> 22802KB
GZIP (deflate, level 6) -> 21247KB
SZIP -> 22812KB

Chunk dimensions are set to (40000,1), tried different combinations
without much difference. On SZIP I’m using H5Pset_szip(
HDF5Constants.H5_SZIP_NN_OPTION_MASK, 8). I’m calling the HDF5 library
from Java.

Are these results common? Somewhat I’m disappointed, I was expecting a
greater size reduction. I’m not an expert on HDF5 or compression
algorithms. If anyone can give me any ideas how to improve these
numbers, I’d appreciate it.

Kind regards,

   Ed

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Room 130G Engineering and Technology
Department of Geology
Central Michigan University
Mount Pleasant, MI 48859
(989)774-1923
Amateur radio callsign: KG4ULP

Yes, thank you for the advice. I tried with shuffle before but I was
setting it _after_ deflate. Apparently doesn't work that way.

With shuffle before deflate:

GZIP (level 6) -> 19465KB
SZIP is returning similar results with or without shuffle

So I'm getting a 15% savings. Probably this data is hard to compress.
Even though they fall into a fixed range, it's not a sine/cosine wave.
Actually, the fact that the data fits into the range of [-100.0,
+100.0] means that instead of float exponents in the range of
[-127,126], I would just need exponents in the range of [-127,8]. This
means I could save ~1 bit on the exponents per float, without losing
precision. That's a compression of 1/32 or ~3%. GZIP is doing much
better.

Appreciate your help.

Kind regards,

    Ed

···

On Mon, Jul 13, 2009 at 4:57 AM, Andrew Collette<andrew.collette@gmail.com> wrote:

No compression -> 22802KB
GZIP (deflate, level 6) -> 21247KB
SZIP -> 22812KB

If you haven't already, you can try using the shuffle filter in
conjunction with deflate. This can significantly increase performance
of gzip-like compressors, especially for floating-point data. It
works by rearranging the data according to byte significance; for
example, an array of three 2-byte ints xyxyxy becomes xxxyyy. Be sure
to enable the shuffle filter first, and then the deflate filter.

Are these results common?

While the exact results will depend on your particular data, I can
tell you that I routinely see size reductions of 25-50% for
experiment-derived floating point data with deflate, and sometimes
more for integer and string data. However, your data may simply be
very hard to compress, even if it's in a fixed numerical range.

Andrew

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

If you're not needling lossless compression, consider a scale-offset
filter. I use scale-offset + gzip and for cloud model data (lots of
zeroes though) and get 10x reduction from uncompressed floats.

Leigh

···

On Mon, Jul 13, 2009 at 3:44 PM, Ed<ed.fuentetaja@gmail.com> wrote:

Yes, thank you for the advice. I tried with shuffle before but I was
setting it _after_ deflate. Apparently doesn't work that way.

With shuffle before deflate:

GZIP (level 6) -> 19465KB
SZIP is returning similar results with or without shuffle

So I'm getting a 15% savings. Probably this data is hard to compress.
Even though they fall into a fixed range, it's not a sine/cosine wave.
Actually, the fact that the data fits into the range of [-100.0,
+100.0] means that instead of float exponents in the range of
[-127,126], I would just need exponents in the range of [-127,8]. This
means I could save ~1 bit on the exponents per float, without losing
precision. That's a compression of 1/32 or ~3%. GZIP is doing much
better.

Appreciate your help.

Kind regards,

Ed

On Mon, Jul 13, 2009 at 4:57 AM, Andrew > Collette<andrew.collette@gmail.com> wrote:

No compression -> 22802KB
GZIP (deflate, level 6) -> 21247KB
SZIP -> 22812KB

If you haven't already, you can try using the shuffle filter in
conjunction with deflate. This can significantly increase performance
of gzip-like compressors, especially for floating-point data. It
works by rearranging the data according to byte significance; for
example, an array of three 2-byte ints xyxyxy becomes xxxyyy. Be sure
to enable the shuffle filter first, and then the deflate filter.

Are these results common?

While the exact results will depend on your particular data, I can
tell you that I routinely see size reductions of 25-50% for
experiment-derived floating point data with deflate, and sometimes
more for integer and string data. However, your data may simply be
very hard to compress, even if it's in a fixed numerical range.

Andrew

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Room 130G Engineering and Technology
Department of Geology
Central Michigan University
Mount Pleasant, MI 48859
(989)774-1923
Amateur radio callsign: KG4ULP