optimizing compression of compound type data

I'm trying to find the optimal compression for 2- and 3-dimensional matrices of recorded data. These data sets contain data that doesn't change much over time (and time is being used as the first axis). I thought that by using shuffle, I might get better compression, but instead the resulting files were larger than without shuffle.

Is shuffle meant to work with compound types? Are there things I need to be considering in the organization of the axes of the data set in order to better encourage compression?

I would think szip compressor might do well on this, at least for 2D
stuff. Not sure about 3D. Have you tried gzip? I always try that to see
where it leads.

I saw a recent email posted to hdf-forum announcing a new compressor
named 'Blosc 1.0'. But the announcement came in relation to pytables and
so I don't know if it is avaiable as a separate HDF5 filter that you can
just grab and use. And, I don't know if by 'shuffle' that is the
compressor you were talking about.

Regarding data organization to 'encourage' compression, I would think
that if things don't '...change much over time...', then making that
axis the 'slowest varying dimension' of the dataset in storage would be
best.

That's all I can think of. Hope it helps. Good luck.

Mark

···

On Tue, 2010-07-13 at 08:06 -0700, John Knutson wrote:

I'm trying to find the optimal compression for 2- and 3-dimensional
matrices of recorded data. These data sets contain data that doesn't
change much over time (and time is being used as the first axis). I
thought that by using shuffle, I might get better compression, but
instead the resulting files were larger than without shuffle.

Is shuffle meant to work with compound types? Are there things I need
to be considering in the organization of the axes of the data set in
order to better encourage compression?

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

A Tuesday 13 July 2010 17:06:03 John Knutson escrigué:

I'm trying to find the optimal compression for 2- and 3-dimensional
matrices of recorded data. These data sets contain data that doesn't
change much over time (and time is being used as the first axis). I
thought that by using shuffle, I might get better compression, but
instead the resulting files were larger than without shuffle.

In my experience, shuffle does generally help in reducing you compressed data
sizes, except when it does not :wink: I mean, experimentation is the best way to
check if shuffle is going to help you or not.

Is shuffle meant to work with compound types? Are there things I need
to be considering in the organization of the axes of the data set in
order to better encourage compression?

Yes, shuffle is designed to work with compound types too. And it works at
chunksize level, so depending on the shape of your chunk and how data changes
on each dimension of this shape, that *could* have a measurable effect indeed.

Out of curiosity, which is the size of your compound type and your chunk size?

···

--
Francesc Alted

A Tuesday 13 July 2010 17:19:35 Mark Miller escrigué:

I saw a recent email posted to hdf-forum announcing a new compressor
named 'Blosc 1.0'. But the announcement came in relation to pytables and
so I don't know if it is avaiable as a separate HDF5 filter that you can
just grab and use.

Yup, you can use Blosc in plain HDF5 too. The HDF Group has registered it, so
you can safely it with your HDF5 files and be able to retrieve that data with
any other HDF5 tool that includes support for Blosc.

And, I don't know if by 'shuffle' that is the
compressor you were talking about.

No, shuffle is an independent filter that comes integrated with HDF5 itself.
Blosc comes with an optimized version of shuffle too, but it works similarly
to the one in HDF5 (it just runs faster).

···

--
Francesc Alted

Francesc Alted wrote:

A Tuesday 13 July 2010 17:06:03 John Knutson escrigué:
  

Is shuffle meant to work with compound types? Are there things I need
to be considering in the organization of the axes of the data set in
order to better encourage compression?
    
Yes, shuffle is designed to work with compound types too. And it works at chunksize level, so depending on the shape of your chunk and how data changes on each dimension of this shape, that *could* have a measurable effect indeed.

Out of curiosity, which is the size of your compound type and your chunk size?

The compound types (there are several) are around 100-200 bytes each. The chunk sizes generally contain between 2K and 4K compound records. This seemed to be the optimal chunk size based on earlier performance testing, as far as reading and writing performance is concerned anyway. In more detail, the chunk sizes might be something like:
16,2,128 in a data set of dimension 403200,2,128

The above chunk size makes sense given the way data is being written into the file, but it might not make as much sense for compression.

Actually, I just thought about this for a bit and realized that I've been (probably needlessly) tying my chunk sizes to the read and write data spaces. If, as I suspect, they're only loosely intertwined, I can keep reading and writing the 16,2,128 space, while using a chunking that is more in tune with compression, e.g. 4096,1,1. I'll have to experiment with that and see what happens.