Zstd filter plugin & dictionary training

werner · April 12, 2021, 4:50pm

I am looking into using @paramon’s filter plugin for the zstd filter and am wondering how to make use of the dictionary feature thereby. According to the documentation, this filter will give a huge performance gain:

https://facebook.github.io/zstd/#small-data

Is that mode supported via the HDF5 filter plugin? It would be cool to train the filter based on data that we already use and improve the filter performance thereby. Such a dictionary could be provided as part of the HDF5 file itself, maybe as an attribute?

epourmal · April 28, 2021, 5:00am

Hi Werner,

Thank you for the pointer.

It is my understanding that Zstd compression is supported in BLOSC2. We do test the filter; see BLOSC filter, but we haven’t played with the Zstd compressor.

I agree, it would be interesting to explore how the dictionary can be stored. User block can be an option too.

Elena

werner · April 28, 2021, 7:07am

Hi Elena,

I had tried the BLOSC filter a while ago, but its multithreading turned out to be instable under Windows/Mingw64, so I cannot really use it. It would have been great. We tried to investigate the issue, but it was never resolved.

In the meantime, I got the zstd filter to work. It required a modification of using H5free_memory() instead of free(), same with malloc(), otherwise the zstd filter crashes. (This issue touches another discussion because it introduces a dependency of the filter from the HDF5 library, which otherwise is not needed).

For the dictionary - it sounds like a very interesting option, but apparently it is a boost merely for small data. Nevertheless - it remains a question how to deal with filters that require additional information beyond those few “unsigned int cd_values[]” that are passed to a filter. Pre-training (compression) filters on types of datasets may be quite useful. Do you have a link to the documentation available how to access such user block from a filter?

           Werner

byrn · April 28, 2021, 11:55am

If you are needing to use H5free_memory(), then the hdf5 library must have been built with the internal memory allocation sanity checks enabled. CMake option “HDF5_MEMORY_ALLOC_SANITY_CHECK” or autoconf option “–enable-memory-alloc-sanity-check” [default=yes if debug build] must have been used.

werner · April 28, 2021, 1:54pm

ah ok. I probably had all debug options switched on, yes. So it’s even worse such that the behavior depends also on how HDF5 was compiled, not just the HDF5 version.

Some time ago I submitted a proposal to have HDF5 provide function pointers for the appropriate free/malloc functions that should be used by the filter plugins. The current version of requiring different filter binaries for different compilation modes/version of HDF5 is unsatisfactory.

epourmal · April 30, 2021, 4:28am

Agreed. We need to fix this issue.

epourmal · April 30, 2021, 4:33am

Since one can add any type of data to the user block there are no HDF5 APIs except setting and getting user block size, see H5Pget(set)_userblock.

BTW, one can use h5jam/h5unjam to add/remove a user block.

Elena

werner · April 30, 2021, 7:06am

Ok, you meant the userblock at the beginning of the file. Yes, I would know how to access that (I used it before to embed an HDF5 file into an HTTP stream, allowing the same file to be read from disk or from an internet socket via a webbrowser).

However, for a compression filter such a filter-specific dictionary should be “closer” to the data sets, because it would depend on the category of data. For instance, I would imagine using a different compression dictionary for RGB data than for floating-point data. Maybe even different dictionaries for RGB data describing images of lakes than for RGB data describing images of mountains. So a better place for such a dictionary would be an attribute on the datasets, or even better, an attribute on the named type that is used for a dataset, such that multiple datasets can use the same dictionary, but not all datasets use the same. Just, how can such information be passed to a filter?

For the current issue of the zstd filter, such a feature seems not really needed, since the dictionary is apparently only useful for small data. We want to think in big data anyway. But it could be useful for AI-based filters that require some training data. I don’t have experience with those yet, but it seems there are some in active development.

epourmal · April 30, 2021, 1:44pm

Good question! I don’t know The simplest thing will be to come up with some convention (like CF), or this is time to rework HDF5 filters, or both? Let me find out what other folks at THG think.

lucasvr · May 25, 2021, 2:21pm

The approach I have taken before was to allocate a block of data and provide a pointer to that data through the cd_values[] array. However, the use case I had at hand also required the application to provide data to the filter at read time, but there are no interfaces that currently enable that. My final approach was to simply use setenv() and getenv(). It worked fine to validate my prototype, but that’s not an ideal solution in the long run.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Zstd filter plugin & dictionary training