I am a newbie to H5, and I am using MATLAB to write a structure array into H5 using compound datatype
My structure has 20 fields, and most fields are 1-D double arrays (20 x 1), and some fields each element consist of 1-D array (like 20 x 1000)
When I tried to define my compound dataset properties with compression = 5
dcpl_id = H5P.create(‘H5P_DATASET_CREATE’); % define PROPERTY LIST
H5P.set_deflate(dcpl_id ,5)
dset = H5D.create (H5File, DatasetName, filetype, space, dcpl_id );
I keep getting errors, asking to initialize chunk size
But i have tried
H5P.set_chunk(dcpl_id,fliplr(20,1));
H5P.set_shuffle(dcpl_id); % - set SHUFFLE FILTE
still error out
For me, I do not think Compound dataset can be chunked, which means Compound dataset can not be compressed also
is this correct that the compound dataset can not be compressed? or am i defining it wrong
There are the following layouts for HDF5 datasets: H5D_COMPACT | H5D_CONTIGUOUS | H5D_CHUNKED | H5D_VIRTUAL where H5D_CHUNKED stands for tiled or chunked layout, and supports filtering and as it turns out some of the filters may rearrange data, others compress them.
Data compression is a bijective map such that after applying the transfer function we get an asymmetry in size. However practical implementations doesn’t give you this guarantee either, in fact the deflate algorithm may end up with a same or bigger size you’ve began with. To support this property of compression algorithms HDF5 filtering provides mechanism to handle cases when there was no actual compression achieved.
In order to write effective compressor one needs to be familiar with the HDF5 API calls, then choose between the two available strategies:
traditional approach using built in filter chain, where all the convenience functions are provided to you, so you can focus on the filter/compressor itself
using the recently provided H5Dwrite_chunk and H5Dread_chunk calls decouple from the built in filtering pipeline and use your own.
As with any choices you are moving on a Pareto front between implementation cost, runtime performance etc; all in all labelling option 2. a riskier, more expensive and harder to implement. If there is interest, please ask the question: how to implement custom filter chain based on direct chunk read|write with HDF5.
From the short side trip, let’s return to chunked datasets. Within each chunk there is binary data, how this data is interpreted by a software is encoded in its data type, where the type information indicates the length of each element, and some other properties like byte order, the operations defined over the type etc… This type information are often opaque to a compression algorithms. To give you an example Phil Katz’s Deflate doesn’t care if the input is a set of integers, floats or compound dataypes, whereas fzip|zfp by LLNL is for single or double precision floats upto 4D arrays exploiting structure.
To answer the question if compound dataset can be compressed: yes, but the result is the function of the compression algorithm used given the dataset. Can you use multiple compressors daisy chained?: Maybe you could, but in general you have to examine if it is justified. How about multiple filters and a compressor? Yes, reshuffling (pre processing data) before applying a final compression often leads to better performance, and higher compression ratio.