Thread-parallel compression in MPI + X context

Hi all,

In many modern application the parallelisation is of the form MPI + X (with X == OpenMP, pthreads,…) with typically one MPI rank per compute node.
When using parallel-hdf5 in this case, only one core per node writes whilst the rest idles. This is normally not a problem as the writes are i/o or communication bound. However, if I am trying to use gzip compression then using only one core is very much a waste.
Is there a plan to make use of the thread-parallel implementation of gzip to speed up that exact process? At the moment compression in parallel is prohibitively expensive unless one MPI rank per compute core is used, which goes against modern HPC design and future plans to scale things up.

If there are no such plans, would you see an option to use the custom filters to do so? And possibly abuse the system to make it look like a regular gzip was applied since the result of the compression is identical in serial and parallel?

Thanks! Any ideas and suggestions welcome.

If C++ is an option for you H5CPP does have stub code for thread level parallel compressors, based on BLAS blocking mechanism. The framework supports blocking to cache size, and chaining of filters such that the input uses half the cache and outputs to the other half; then flips it for the next filter. In fact the H5CPP IO layers use this blocking mechanism by default then delegate to the newly available 1.10.4 >= direct chunk IO bypassing CAPI subsetting code path when not needed.

However I didn’t test this with parallel HDF5 yet.


Thanks for the suggestion. Unfortunately, C++ isn’t really an option for us. (Although I guess we could wrap it in some ways into our C code…)

Do you know whether what you describe for C++ will be ported more widely into the general C hdf5 version?