Optimize compression performance

michael.goelzer · April 29, 2025, 1:58pm

I have a very simple prototype that is mostly based upon examples from the “Learning the Basics” chapter, in particular the C++ version of “Create a chunked and compressed dataset”. Right now, this prototype is severely performance-bottlenecked by compressing the chunks.

With the setup I have at hand, a 50GB sample dataset can be written in around 40 seconds uncompressed, but writing the same dataset takes almost 4 minutes with byteshuffle and level 1 deflate. The time difference between using the built-in shuffle/deflate or the one from the blosc2 filter plugin is measurable, but not in order of magnitudes.

I found the document HDF5ImprovingIOPerformanceCompressedDatasets.pdf, which mostly suggests optimizing chunk sizes. While I concede that my chunk sizes aren’t super optimized, this doesn’t appear to be the main issue. The main issue is probably that the chunks are processed sequentially (compress/write 1 → compress/write 2 → compress/write 3 …), and hence available CPU cores are not utilized as they should be.

For writing a single chunk I currently use H5::DataSpace::selectHyperslab followed by H5::DataSet::write.

Some ideas:

Do compression/decompression manually and use direct chunk access via H5Dwrite_chunk / H5Dread_chunk. This approach was suggested by someone from The HDF Group Staff ten years ago, see queuing-chunks-for-compression-and-writing. I am not sure how up-to-date that advice still is.
Feed into selectHyperslab and write from multiple threads. Intuitively this feels suspicious. I would expect that the library ensures thread-safety in a way that probably prevents this approach from delivering huge performance benefits.
Using the ParallelHDF5 approach and then feed into that from multiple threads. This seems to be primarily targeted towards multi-process applications and as such appears to be inherently overkill for single-process applications, but looks feasible in principle. The degree of potential performance benefits is again unclear.

If anyone can share experiences or recommendations, it would be greatly appreciated.

nfortne2 · May 1, 2025, 7:38pm

Idea 1 is probably your best bet. This way, as long as your compression library is multithreaded, you can get full concurrency on compression. The actual I/O will be serialized, but that will generally happen anyways unless you’re on a parallel filesystem. Of course you need to take care to compress the data and set the cd_values correctly so it will be readable later, as with any usage of H5Dwrite_chunk().

Idea 2 wouldn’t work currently. If you build with threadsafety on, which is required for multithreaded access, only one thread will be allowed into the library at a time, preventing any performance benefit from multithreading. If you build with threadsafety off, multithreaded access will almost certainly cause errors.

Idea 3 would work but you would need to convert your program to use multiple processes. You could a get benefit from parallelizing compression, but the inter-process communication needed and the restrictions placed upon access to HDF5 may outweigh that benefit.

michael.goelzer · May 2, 2025, 7:06am

Thanks a lot! That’s pretty much what I expected. Good to know I am not on the completely wrong track. I will dive into direct chunk access and see how that turns out.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Optimize compression performance