Writing already compressed data to dataset

lkshukla · September 23, 2024, 2:52pm

Hi HDF5 users,
Does someone has any clue w.r.t below issue.

Background:
My custom plugin only supports decompression. I.e. encoder_present flag is set to 0 within struct H5Z_class2_t.

Use case:
I want to create a dataset and specify that the plugin filter of ID, e.g. 1234 should be used ONLY for decompression, e.g. with tools like HDFView or HDF command line tools. On the other hand, while writing the data, i want to write compressed data (size different from decompressed) without invoking the encoded of custom filter plugin.

Issue:
I end up with error message: “Filter present but encoding is disabled”. This error message is clear to me, but i don’t know how to avoid this, because this is my use case.

Below is the sample code:

H5::H5File hdf_file_handle("compression_not_invoked.hdf5", H5F_ACC_TRUNC);
H5::Group group = hdf_file_handle.createGroup("/compressed_group");

hsize_t decompressed_data_dims[2] = {10, 10};
H5::DataSpace dataspace(2, decompressed_data_dims);

unsigned int flags = 0;
H5::DSetCreatPropList cparms;
cparms.setChunk(2, decompressed_data_dims);
cparms.setFilter(1234, flags, 0, NULL);

H5::DataSet dataset = group.createDataSet(
	"test_data", H5::PredType::NATIVE_UINT16, dataspace, cparms);
	
// To be filled with real values later
// data size is less than 10X10, compressed data is less than decompressed	
unsiged short compressed_data[80] = {};
dataset.write(compressed_data, H5::PredType::NATIVE_UINT16, dataspace);                

hdf_file_handle.close();

Any idea on which part of the code is getting wrong? I am pretty sure that it is something to do with the way dataset.write is invoked, but currently clueless.

PS:

In python dset.id.write_direct_chunk does avoid invoking encoder, but, the filter should still support encoding. Unfortunately, this is not the use use.
For users guessing why i want this approach, the reason is that i need to validate some metrics on the compressed data, only then write it into the file.

bmribler · September 23, 2024, 9:50pm

Have you tried setting flags to H5Z_FLAG_OPTIONAL?

https://support.hdfgroup.org/releases/hdf5/v1_14/v1_14_5/documentation/doxygen/group___o_c_p_l.html#ga191c567ee50b2063979cdef156a768c5

lkshukla · September 24, 2024, 9:16am

Yes, I already tried with H5Z_FLAG_OPTIONAL, but end up with the same message “Filter present but encoding is disabled”.
As per the documentation, H5Z_FLAG_OPTIONAL comes handy when the encoding is supported and the return value is 0 (indicating no compression to be applied or an error occurred and compression is skipped). In my use case, I want to disable the encoding completely.

gnwiii · September 24, 2024, 8:58pm

I’m not sure your use case fits the HDF5 data model, which is designed to support structured data. Compression needs to done in a way that is compatible with the structure so requests for a small portion of the data can be satisfied efficiently (in time and space metrics) without having to decompress the full data set.

lkshukla · September 25, 2024, 3:24am

I started working on HDF topic few weeks ago and at least from documentation, HDF5: H5Z_class2_t Struct Reference (hdfgroup.org), it is possible to create a filter which only supports decoding i.e. the encoded data (compressed) data can be written without invoking a filter. But how, this isn’t clear and unfortunately not documented straight forward.

On the other hand, using H5Dwrite_chunk achieves the same functionality i.e. it skips the encoder part , but for this the filter encoder property should be enabled. However, my aim is to avoid users using the plugin for compression completely. Hence, this doesn’t fit the use case.

jhenderson · September 25, 2024, 6:25pm

Hi @lkshukla,

I believe that this is a use case that wasn’t necessarily thought about, though it makes perfect sense to me to be able to do it and I think it would be a good future expansion of functionality. The encoder_present flag is intended more for the library to be able to properly return errors when an application tries to create/write to a dataset with a filter that doesn’t have encoding enabled. IIRC, this feature came about due to licensing issues with the SZIP filter where users could freely use the filter for decoding data, but may have had to obtain a commercial license for encoding data. It’s an older document, but see Szip Copyright and License Statement in HDF. The use case was that a user could have a version of SZIP on their system that was built without encoding enabled and they could still read compressed data written somewhere else, while the library would throw an error for writes.

In your case, you may be able to simply specify that your filter has encoding enabled and then in your filter function simply return the nbytes parameter in the encode case. It’s a bit of a hacky workaround and doesn’t skip invoking the encoder, but should generally do until we can think more about this use case.

pl1 · September 25, 2024, 6:44pm

This use case is closely related to one that @miller86 and I have discussed in the context of zfp compression, and in particular serialization of zfp’s compressed-array classes. In our case, the array data already is stored in memory in compressed form. To write it as an HDF5 file with zfp compression enabled, one has to first decompress the entire array, then feed it through the zfp HDF5 compression filter (H5Z-ZFP) to re-compress it. Not only does this incur a performance overhead and potential generation loss, but additional memory is needed to hold the uncompressed data. The primary reason for using zfp arrays for computations is usually lack of memory. It would be nice if there was a mechanism that allowed bypassing the filter and marking the data as compressed already.

miller86 · September 25, 2024, 7:45pm

I agree with @pl1 observations. That said, we do in fact demonstrate an approach and test it in H5Z-ZFP using HDF5’s direct write chunk methods. This approach does burden the data producer somewhat because there needs to be logic to use H5Dwrite() with the filter normally for non-compressed in-memory data and then H5Dwrite_chunk() logic for the case its writing a compressed array. Our tests do confirm, however, that once written, consumers will be agnostic for this aspect of things.

It might be cool to find a way to make H5Z-ZFP just “do the correct thing” by detecting if its being handed ZFP compressed array data and then delegating the work to H5Dwrite_chunk(). Because the H5Z-ZFP filter would be recieving the data on an chunk by chunk (those are HDF5 chunks) basis, I think this might be possible.

@pl1…is there anything in the ZFP compressed array data that could be used to automatically detect this situation for any arbitrary HDF5 chunk the filter might be handed?

pl1 · September 25, 2024, 8:17pm

If the data is just the payload compressed zfp blocks, then I’m afraid the answer is no. However, when serializing a zfp compressed-array object, you also get a short header that contains a magic word, array metadata (scalar type, dimensions), and compression parameters; see this discussion and the code snippet therein. That same header is written by H5Z-ZFP, though I believe the HDF5 output contains additional information written by the filter.

Then there’s the issue of chunking. Still, we could build the smarts into H5Z-ZFP to “do the right thing” by assembling/writing the set of zfp blocks associated with a chunk. IIRC, each chunk is written with a separate zfp header.

Things get more complicated with zfp’s variable-rate array classes (e.g., zfp::const_array), where the compressed size of blocks varies. zfp indexes those blocks using an efficient coding scheme that could be used to extract the compressed data associated with each block of a chunk.

This is a bit different from just asking the filter to pass through the compressed data verbatim, though that also assumes that the application knows how to form a compressed stream that would be identical to if data were written through the filter in uncompressed form. The application would have to know exactly what metadata the filter inserts in addition to the payload compressed data.

lkshukla · September 26, 2024, 7:14am

Hi @jhenderson ,
Thanks for the feedback. I have exactly done the same for time being.
Thanks again for the hint on building only the decode version of the filter.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Writing already compressed data to dataset