Can Compound Dataset been compressed?


#1

Hi guys

I am a newbie to H5, and I am using MATLAB to write a structure array into H5 using compound datatype

My structure has 20 fields, and most fields are 1-D double arrays (20 x 1), and some fields each element consist of 1-D array (like 20 x 1000)

When I tried to define my compound dataset properties with compression = 5
dcpl_id = H5P.create(‘H5P_DATASET_CREATE’); % define PROPERTY LIST
H5P.set_deflate(dcpl_id ,5)
dset = H5D.create (H5File, DatasetName, filetype, space, dcpl_id );

I keep getting errors, asking to initialize chunk size
But i have tried
H5P.set_chunk(dcpl_id,fliplr(20,1));
H5P.set_shuffle(dcpl_id); % - set SHUFFLE FILTE

still error out

For me, I do not think Compound dataset can be chunked, which means Compound dataset can not be compressed also

is this correct that the compound dataset can not be compressed? or am i defining it wrong


#2

Problem resolved,
I am guessing due to my space is created with
space = H5S.create_simple (1,20, []);
so chunk size shall be [1,]

and with below code, I am able to compress my compound dataset

dcpl_id = H5P.create(‘H5P_DATASET_CREATE’); % define PROPERTY LIST
H5P.set_chunk(dcpl_id,[1,]); %#ok
H5P.set_deflate(dcpl_id,5);


#3

There are the following layouts for HDF5 datasets: H5D_COMPACT | H5D_CONTIGUOUS | H5D_CHUNKED | H5D_VIRTUAL where H5D_CHUNKED stands for tiled or chunked layout, and supports filtering and as it turns out some of the filters may rearrange data, others compress them.
image
Data compression is a bijective map such that after applying the transfer function we get an asymmetry in size. However practical implementations doesn’t give you this guarantee either, in fact the deflate algorithm may end up with a same or bigger size you’ve began with. To support this property of compression algorithms HDF5 filtering provides mechanism to handle cases when there was no actual compression achieved.

In order to write effective compressor one needs to be familiar with the HDF5 API calls, then choose between the two available strategies:

  1. traditional approach using built in filter chain, where all the convenience functions are provided to you, so you can focus on the filter/compressor itself
  2. using the recently provided H5Dwrite_chunk and H5Dread_chunk calls decouple from the built in filtering pipeline and use your own.
    image
    As with any choices you are moving on a Pareto front between implementation cost, runtime performance etc; all in all labelling option 2. a riskier, more expensive and harder to implement. If there is interest, please ask the question: how to implement custom filter chain based on direct chunk read|write with HDF5.

From the short side trip, let’s return to chunked datasets. Within each chunk there is binary data, how this data is interpreted by a software is encoded in its data type, where the type information indicates the length of each element, and some other properties like byte order, the operations defined over the type etc… This type information are often opaque to a compression algorithms. To give you an example Phil Katz’s Deflate doesn’t care if the input is a set of integers, floats or compound dataypes, whereas fzip|zfp by LLNL is for single or double precision floats upto 4D arrays exploiting structure.

To answer the question if compound dataset can be compressed: yes, but the result is the function of the compression algorithm used given the dataset.
Can you use multiple compressors daisy chained?: Maybe you could, but in general you have to examine if it is justified.
How about multiple filters and a compressor? Yes, reshuffling (pre processing data) before applying a final compression often leads to better performance, and higher compression ratio.