I would suggest away from the library being internally threaded and leaving that flexibility up to the user (exception noted at end). If this means one cannot have the expectation of multiple writers to one file simultanously, that is OK - I would rather use VDS to glue together a parallel file than have the complications and performance/flaws of distributed filesystems creep into the performance HDF5 can deliver. That might let me write files in local memory and move it into distributed spaces later for instance, but still end up with something good for archiving and not needing of further post-processing. I don’t think you’ll scratch the performance scratch or variety of configurations with this method either - I’ve brought up previously, what if you have multiple drive or raid partitions you are IOing to, and how do you want to think about tmpfs (RAM) filesystems? The library should not impose unexpected performance or locking/starvation based on how any one independent thread or device is performing if others are independent.
I think the client/server approach that internally uses some form of IPC and processes is interesting but unfortunately I think it’s alot more risky (time sink) than it might seem. You get all of the flaws and problems of trying to do efficient and performant true inter-process communication, which is a very platform-dependent programming with far too much variation to get right. The subtle behavioral differences will also be hard to get right or fully implement & forward. And of course it’s inefficient, even if using shared memory, you have to do quite some extra work and you still end up with copies and synchronization problems.
I am aware of selection based IO, from my own experiences, you will need multiple “reactors” to process this - one will be a bottleneck. Again, fully seperate files/intermediate datastructures completely removes the need for this by letting the user do their own thing per thread; it’s “simpler”. It also makes me think HDF5 is becoming a bit of a networking stack, but I guess if MPI is involved I can see that existing.
Food for thought: Perhaps other 3rd party datastructure libraries could help you substitute out the parts of HDF5 that are old, clunky and not thread-minded. C++ exposed as a c-api gives you modern atomics, thread primatives for instance but it may make implementing or getting some of the other datastructures easier as well. Just saying, you went cmake - this could be a way to cut costs and get a lower timeline. Just from the threading perspective, locking practices are much more structured, safe, and lightweight and cross-platformed. Or not.
Well anyway, I’d work directly towards “Outline of work for full multi-threaded HDF5 library”, and see whatever could be done to reduce risk and fully deliver on that goal in 1 years time.
In the mean time I would hedge and provide a path for posix VFD to be fully thread safe and concurrent with packet-table style writing without filters. If it was possible without tons of work, I’d see if I could get shuffle and gzip to work on there in chunked mode and that’s the end of the road for this hedge/crutch mode while the long term solution is being worked on.
I will have to back track a little and admit, if it came to compressing a huge chunk of data, it could make sense to have a parallel compression, but shuffle and gzip (low) and other faster algorithms really seem to not need this, so it’s kind of a mixed bag. I think I’d stick to simpler and later paralellize compression as an extended effort since it’s easy for users to do other things that address it (say lz4). Perhaps if it is easier to provide the compressed data form to HDF5, this issue can be more side stepped by allowing the user to parallelize the filtering + compression however they see fit? Interesting thought for removing compute from core IO.
Glad this issue is finally being evaluated and being worked on at the levels it is! Wishing funds your way :-). Hopefully DOE etc will take note of how imporatant this should be to them.