Zstd filter plugin & dictionary training

I am looking into using @paramon’s filter plugin for the zstd filter and am wondering how to make use of the dictionary feature thereby. According to the documentation, this filter will give a huge performance gain:

https://facebook.github.io/zstd/#small-data

Is that mode supported via the HDF5 filter plugin? It would be cool to train the filter based on data that we already use and improve the filter performance thereby. Such a dictionary could be provided as part of the HDF5 file itself, maybe as an attribute?

Hi Werner,

Thank you for the pointer.

It is my understanding that Zstd compression is supported in BLOSC2. We do test the filter; see BLOSC filter, but we haven’t played with the Zstd compressor.

I agree, it would be interesting to explore how the dictionary can be stored. User block can be an option too.

Elena

Hi Elena,

I had tried the BLOSC filter a while ago, but its multithreading turned out to be instable under Windows/Mingw64, so I cannot really use it. It would have been great. We tried to investigate the issue, but it was never resolved.

In the meantime, I got the zstd filter to work. It required a modification of using H5free_memory() instead of free(), same with malloc(), otherwise the zstd filter crashes. (This issue touches another discussion because it introduces a dependency of the filter from the HDF5 library, which otherwise is not needed).

For the dictionary - it sounds like a very interesting option, but apparently it is a boost merely for small data. Nevertheless - it remains a question how to deal with filters that require additional information beyond those few “unsigned int cd_values[]” that are passed to a filter. Pre-training (compression) filters on types of datasets may be quite useful. Do you have a link to the documentation available how to access such user block from a filter?

           Werner

If you are needing to use H5free_memory(), then the hdf5 library must have been built with the internal memory allocation sanity checks enabled. CMake option “HDF5_MEMORY_ALLOC_SANITY_CHECK” or autoconf option “–enable-memory-alloc-sanity-check” [default=yes if debug build] must have been used.

ah ok. I probably had all debug options switched on, yes. So it’s even worse such that the behavior depends also on how HDF5 was compiled, not just the HDF5 version.

Some time ago I submitted a proposal to have HDF5 provide function pointers for the appropriate free/malloc functions that should be used by the filter plugins. The current version of requiring different filter binaries for different compilation modes/version of HDF5 is unsatisfactory.

Agreed. We need to fix this issue.

1 Like

Since one can add any type of data to the user block there are no HDF5 APIs except setting and getting user block size, see H5Pget(set)_userblock.

BTW, one can use h5jam/h5unjam to add/remove a user block.

Elena

Ok, you meant the userblock at the beginning of the file. Yes, I would know how to access that (I used it before to embed an HDF5 file into an HTTP stream, allowing the same file to be read from disk or from an internet socket via a webbrowser).

However, for a compression filter such a filter-specific dictionary should be “closer” to the data sets, because it would depend on the category of data. For instance, I would imagine using a different compression dictionary for RGB data than for floating-point data. Maybe even different dictionaries for RGB data describing images of lakes than for RGB data describing images of mountains. So a better place for such a dictionary would be an attribute on the datasets, or even better, an attribute on the named type that is used for a dataset, such that multiple datasets can use the same dictionary, but not all datasets use the same. Just, how can such information be passed to a filter?

For the current issue of the zstd filter, such a feature seems not really needed, since the dictionary is apparently only useful for small data. We want to think in big data anyway. But it could be useful for AI-based filters that require some training data. I don’t have experience with those yet, but it seems there are some in active development.

Good question! I don’t know :slight_smile: The simplest thing will be to come up with some convention (like CF), or this is time to rework HDF5 filters, or both? Let me find out what other folks at THG think.

1 Like

The approach I have taken before was to allocate a block of data and provide a pointer to that data through the cd_values[] array. However, the use case I had at hand also required the application to provide data to the filter at read time, but there are no interfaces that currently enable that. My final approach was to simply use setenv() and getenv(). It worked fine to validate my prototype, but that’s not an ideal solution in the long run.