Image сompression and ZSTD training dictionary question

I’ve only been studying HDF5 for a few days, so sorry for the possibly stupid questions.
I’m experimenting with the level of image compression that HDF5 can give compared to other containers.
Using OpenCV, I take the raw image pixel data (RGB or RGBA) from many images (thousands) and save this data as a datasets in an H5 file without compression by H5IMmake_image_24bit() function
The following data is obtained:

h5dump file.h5:

HDF5 “file.h5” {
GROUP “/” {
DATASET “image1.png” {
DATASPACE SIMPLE { ( 256, 256, 3 ) / ( 256, 256, 3 ) }
(0,0,0): 223, 211, 170,
(0,1,0): 223, 211, 170,
(0,2,0): 223, 211, 170,

DATASET “image2.png” {

Then I try to compress it using the h5repack utility and the ZSTD filter as shown below:
h5repack -l CHUNK=60x60x3 -f UD=32015,10 file.h5 file_zstd.h5

As a result, I get worse compression than if I compressed the same RGB data files in TAR+ZSTD archive.
TAR.ZSTD Size / H5 Size ~ 0.72

I understand that most likely I will not get the same compression ratio that tar+zstd gives,
but perhaps there are ways to improve compression in H5?

Perhaps the best compression would be to use ZSTD with a trained dictionary.
Is it possible to use the trained dictionary in zstd plugin, or perhaps such work is already underway.
I will be glad to any ideas that will help me compress my data better inside HDF5.

@victor.ustynov , chunking is most efficient when chunk size divides the full array evenly without leaving partially filled chunks along the edges. Also, larger chunks are more efficient than smaller. Try CHUNK=256x256x3 for your test case.

Also it may be important to use one of the most recent HDF5 versions. In particular, HDF5 1.8 versions did not handle edge chunks efficiently.

@dave.allured Thank you for answer.
I have images of different sizes. Yes, when I created a group of images of the same size and made the size of the chunk equal to the size of the images, it gave the effect.
However, the total hdf5 file size is still larger than if I compress the data with tar+zstd.
I think that there could be an effect of using a trained zstd dictionaries. But it looks like the hdf5 API doesn’t support it, or I don’t know about it.

1 Like

Okay. A chunk size equal to the total image size for each image individually, is a good strategy. This will avoid some chunk overhead and give you the best result in terms of file structure.

There will always be some overhead for HDF5 internal structure. This gives advantages such as random access, the ability to add attributes, and group structuring if you would like that. In contrast, tar format is maximally squeezed, allowing room only for the image file name and the compressed data block.

For a trained dictionary, perhaps you could modify the existing zstd filter into a custom filter. Others who know more about zstd may have better advice.