Thanks @steven for your prompt response.
In order to make it multi threaded, I added a custom filter pipeline based on BLAS level 3 blocking algorithm, to have control the data flow, then you need synchronisation primitives to make sure you are writing from a single thread – reading can be more relaxed. Final step is to call H5D_WRITE_CHUNK (dset_id, dxpl_id, filters, offset, data_size, buf) which only cares about the coordinates, what filter you used etc…
I think I understand what you are saying here, but you’ll have to forgive me I’m less familar with HDF5. I looked at the H5Zdeflate.c example, and I don’t see this in the code. I assume you did this in your application code, but I am trying to express parallelism in the filter itself. Additionally, When I look at H5Dread_chunk
I see the following remark: “Also note that H5D_READ_CHUNK and H5D_WRITE_CHUNK are not supported under parallel” (emphasis mine). For the compressors that I support with libpressio, MPI and parallel HDF is more important than threads, but I want to support both if I can.
What I was asking is “what guarantees does my filter plugin need to provide to HDF5?”. Do I need to promise thread safety, do I need to protect against multiple threads interleaving calls to _set_local
and _filter
, etc?
As for MPI: are you planning to distribute the chunk to compress it then save it on each participating ranks?
That is definitely one option, but not the one that I was currently considering. One of the compressors for libpressio is called libpressio-opt; an optimizing meta compressor which does a parallel search using MPI for the configuration of an underlying lossy or lossless compressor which maximizes some objective (typically compression ratio, sometimes compression speed) while maintaining some user-defined quality standard (i.e. the Kolmogorov-Smirnov test p-value between the compressed and decompressed chunk for each chunk is at not significant, or that Peak Signal to Noise Ratio is at least some level in dB). For performance reasons, this uses MPI parallelization.
The goodness of compression maybe inverse proportional of chunk size.
Of course; it almost always is.
IMHO: the parallelism of compressor/decompressor is implementation detail decided at runtime, after all you don’t know where your software runs.
I completely agree. LibPressio models compressors as tree-based hierarchy where nodes are compressors which share parallel resources (i.e. threads, mpi ranks, cuda devices, etc…) so that users can control the contention over parallel resources between the various compressors. Users choose which level of the hierarchy to allocated the provided threads or MPI ranks using directives like pressio:nthreads
, distributed:mpi_comm
, distributed:n_worker_groups
stored in a pressio_options
structure. I’m not trying to make these decisions on my own, I’m trying to figure out how to honor the requests that users make using this facility when they use LibPressio via hdf filters.
I agree these are probably not things to persist to cd_values
because those persist when we read the data back in since the machine and resources might change.
The Intel MKL BLAS controls thread level parallelism with ENV variables
I know this is a popular method to control threading, but it doesn’t generalize well to all of the kinds of parallelism used by some applications. For example, in an MPI application, the user might programmatically allocate an MPI_comm
subcommunicator and provide that to libpressio opt to do compression with.