Support for more complex compression filters

robertu · February 17, 2021, 5:50pm

Hi HDF folks!

I’m the primary developer for libpressio a generic compression abstraction for dense multi-dimensional arrays that supports a number of error bounded lossy and lossless compressors. I’ve been working on a filter plugin for HDF5, and I have a few questions about the Filter API:

Some of the compressors that libpressio support require non-serializable parameters. Good examples of this would be cudaStream_t or MPI_Comm which aren’t guaranteed to be serializable used for compressors that support cuda or MPI to control which resources are allocated to particular task. How should this be done using HDF5 filter plugins?
I couldn’t find any documentation about calling conventions for multi-threaded/multi-mpi-rank components of HDF5 interact with filters. Do I need to write my filters to be thread safe? Can I spawn multiple threads in the filter? The specific case that I am concern about is SZ (a compressor which already has a limited HDF5 filter) uses global state for the compressor settings which if modified to different settings by different threads could create a time-of-check vs time-of-use bug. If using HDF5 in a multi-threaded manner, do I need to insert mutexs to protect this state from when it set in _set_local to where it is used in the _filter function? Likewise, is there a supported method for having a HDF5 filter which uses multiple MPI ranks?
Is there a limit on the number of cd_values? If so what is it officially? I saw this post in the forum but I didn’t see an answer. I don’t have the problem that the author has about including pointers in the cd_values since I discard the opaque pointer options for now.

steven · February 17, 2021, 6:30pm

an example how I implement gzip filter with compress2 in C++

	inline size_t gzip( void* dst, const void* src, size_t size, unsigned flags, size_t n, const unsigned params[]){
		size_t nbytes = size;
		int ret = compress2( (unsigned char*)dst, &nbytes, (const unsigned char*)src, size, params[0]);
		return nbytes;
	}

In order to make it multi threaded, I added a custom filter pipeline based on BLAS level 3 blocking algorithm, to have control the data flow, then you need synchronisation primitives to make sure you are writing from a single thread – reading can be more relaxed. Final step is to call H5D_WRITE_CHUNK (dset_id, dxpl_id, filters, offset, data_size, buf) which only cares about the coordinates, what filter you used etc…

IMHO: the parallelism of compressor/decompressor is implementation detail decided at runtime, after all you don’t know where your software runs.
The Intel MKL BLAS controls thread level parallelism with ENV variables:

export MKL_NUM_THREADS=48
export OMP_NUM_THREADS=48

ENV set use the value or threads
not set: use all cores or just a single one

As for MPI: are you planning to distribute the chunk to compress it then save it on each participating ranks? The goodness of compression maybe inverse proportional of chunk size.

robertu · February 17, 2021, 7:12pm

Thanks @steven for your prompt response.

In order to make it multi threaded, I added a custom filter pipeline based on BLAS level 3 blocking algorithm, to have control the data flow, then you need synchronisation primitives to make sure you are writing from a single thread – reading can be more relaxed. Final step is to call H5D_WRITE_CHUNK (dset_id, dxpl_id, filters, offset, data_size, buf) which only cares about the coordinates, what filter you used etc…

I think I understand what you are saying here, but you’ll have to forgive me I’m less familar with HDF5. I looked at the H5Zdeflate.c example, and I don’t see this in the code. I assume you did this in your application code, but I am trying to express parallelism in the filter itself. Additionally, When I look at H5Dread_chunk I see the following remark: “Also note that H5D_READ_CHUNK and H5D_WRITE_CHUNK are not supported under parallel” (emphasis mine). For the compressors that I support with libpressio, MPI and parallel HDF is more important than threads, but I want to support both if I can.

What I was asking is “what guarantees does my filter plugin need to provide to HDF5?”. Do I need to promise thread safety, do I need to protect against multiple threads interleaving calls to _set_local and _filter, etc?

As for MPI: are you planning to distribute the chunk to compress it then save it on each participating ranks?

That is definitely one option, but not the one that I was currently considering. One of the compressors for libpressio is called libpressio-opt; an optimizing meta compressor which does a parallel search using MPI for the configuration of an underlying lossy or lossless compressor which maximizes some objective (typically compression ratio, sometimes compression speed) while maintaining some user-defined quality standard (i.e. the Kolmogorov-Smirnov test p-value between the compressed and decompressed chunk for each chunk is at not significant, or that Peak Signal to Noise Ratio is at least some level in dB). For performance reasons, this uses MPI parallelization.

The goodness of compression maybe inverse proportional of chunk size.

Of course; it almost always is.

IMHO: the parallelism of compressor/decompressor is implementation detail decided at runtime, after all you don’t know where your software runs.

I completely agree. LibPressio models compressors as tree-based hierarchy where nodes are compressors which share parallel resources (i.e. threads, mpi ranks, cuda devices, etc…) so that users can control the contention over parallel resources between the various compressors. Users choose which level of the hierarchy to allocated the provided threads or MPI ranks using directives like pressio:nthreads, distributed:mpi_comm, distributed:n_worker_groups stored in a pressio_options structure. I’m not trying to make these decisions on my own, I’m trying to figure out how to honor the requests that users make using this facility when they use LibPressio via hdf filters.

I agree these are probably not things to persist to cd_values because those persist when we read the data back in since the machine and resources might change.

The Intel MKL BLAS controls thread level parallelism with ENV variables

I know this is a popular method to control threading, but it doesn’t generalize well to all of the kinds of parallelism used by some applications. For example, in an MPI application, the user might programmatically allocate an MPI_comm subcommunicator and provide that to libpressio opt to do compression with.

steven · February 17, 2021, 8:31pm

Hi Robert,

H5CPP is an MPI friendly header only HDF5 library for modern C++, if you are interested to plug your libPressio in, I can walk you through how you can do it, step by step.

H5CPP comes with the mechanism: custom pipeline, custom DAPL, etc… , all you have to provide the algo. Let’s see what others have to say about the CAPI.

best: steve

robertu · February 18, 2021, 1:24pm

@steven your library looks really easy to use, and I will probably use it for personal use, but I can’t adopt it into libpressio. I need to support GCC 4.8.5 for one of the main users of my library which largely keeps me at C++11. I’ve backported several C++17 library features from libcxx in a library that I call libStdCompat, but I can’t adopt C++17 without breaking their build.

steven · February 18, 2021, 8:50pm

Certainly. I did the compat thingy and it proved to be infeasible.

You can develop your mechanism from scratch – probably you will have to subsidise the development from your own time, as it is non-trivial. On the upside, you have have full control.

Or… compile the code block with C++17 then export it as extern "C"

extern "C" {
 // your public api comes here...
}

best: steve

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Support for more complex compression filters