Thread-parallel compression filters? - Feature request

Hi all,

I have a large production application that uses MPI + threads. The i/o pattern is either pure parallel-hdf5 or one file/rank (in a similar way to the VFD interface but done by hand). I make heavy use of compression filters.

As my application is MPI + threads, only one thread per MPI rank actually does all the i/o work. In an uncompressed scenario, that is fine as I can then saturate the Infiniband cards with my 2 (or 4) MPI ranks per node. However, when compressing, this gets a lot slower. Only one thread does the compression and this now the clear bottleneck of the whole operation.
It seems a bit silly as I have many idling threads which could all be drafted in for parallel compression.

Would it hence be possible to see the use of pigz over plain gzip in future releases?

Note that I’d like to avoid creating my own custom filter as I’d then have to distribute it collaborators using simulation results. Too high a barrier for access for many unfortunately.

There are a few applications now that have implemented thread parallel compression and decompression:

One trick is to use H5Dread_chunk or H5Dwrite_chunk. This will allow you to read or write the chunk directly in its compressed form. You can then setup the thread-parallel compression or decompression yourself.

Another approach is using H5Dget_chunk_info to query the location of a chunk within the file. H5Dchunk_iter provides a faster way to do this, particularly if you want to get this information for all the chunks, but this is a relatively new API function.

The source for many of the filters is located in the following repository.

For example, the code for the Zstd filter is here:

From the source code there, you can see it simply uses ZSTD_decompress or ZSTD_compress.

It would be pretty easy to swap that out for ZSTD_compressCCtx or ZSTD_decompressDCtx and provide the parameter ZSTD_c_nbWorkers to use multiple threads per chunk. However, I suspect that having multiple threads deal with individual chunks may be more efficient. This depends on your chunking scheme.

Thanks! Good to see that the need is not unique indeed.

I’d worry though that this would lead to people using my data needing a custom filter (and hence a custom h5py) to make use of the data we produce. I’d have to somehow fudge things to make each chunk look like it was compressed with the vanilla gzip filter. Seems a bit of a hack.

Also, I am exploiting some DScale filters prior to gzip so all this would need to be done by hand too, which seems like a lot of risky work to do for a library outsider.

The question is how are people obtaining h5py. Most people either obtain h5py from pip or conda.

PyPI (via pip)

Conda-Forge (conda)

The source repository for those packages is maintained by the synchrotron community via the silx project.

Additionally, zstandard is an open source project maintained by Meta (fma Facebook):

These third party filters are generally open, free, and registered:

The HDF Group maintains a repository that collects the source code of these filters:

There is an ongoing conversation along these lines in another thread: