Multi-threading in the HDF5 library

Dear members of the HDF5 community,

We would like to share with you a document on some approaches we are considering for enabling multi-threading in HDF5.

I am attaching a document that outlines the considered use cases and possible solutions. This is a high-level document and doesn’t go into implementation details. We hope to get your input on the use cases and stir further discussion on implementation approach and maybe get some contributions :smile:Multi-threaded-HDF5-2019-12-09.pdf (129.1 KB)

Thank you!


1 Like

I would like to add a use-case with respect to Oxford Nanopore Technologies (ONT) raw signal data processing. Note that here I explain the problem very briefly. if something is unclear or additional information is required, I can provide.

ONT, a third-generation sequencing company, uses HDF5 file format to store their raw data. The use of raw signal data for downstream analysis has been a popular approach among research scientists, thus open-source research tools such as Nanopolish have been developed. Such tools require to efficiently read the raw Nanopore data, as opposed to writing. Nanopore HDF5 data files being terabytes in size, the inefficient multi-threaded access in HDF5 library poses a significant bottleneck in such tools.

We profiled the time spent on HDF5 I/O for performing a Nanopore raw data-based analysis (methylation calling) using a Nanopore PromethION dataset. The profiling was done on a server with 72 Intel CPU cores, 384 GB RAM and HDD RAID array composed of 12 spin disks (RAID level 6). The processing of data was done using all available CPU cores, while reading of HDF5 was performed with only a single thread. Unfortunately, data processing took only 2.92 hours while HDF5 I/O took 69.27 hours.

To verify that the above mentioned high I/O time is due to the limitation in HDF5 library, rather than something else, we conducted an experiment where we read a Nanopore dataset (not the above complete dataset, we used a smaller dataset to keep the runtime relatively smaller) first using multiple threads and then using multiple processes. For the case with multiple threads, we launched a thread pool and the main thread assigned the HDF5 reads to this thread pool. As expected, there was no improvement in the I/O time (even got worse on SSD) – see the figure below. For, the case with multiple processes we launched a process pool and the parent process assigned its HDF5 reads to these children processes. Inter-process communication was performed with unnamed pipes. As expected, the I/O performance got dramatically better with the number of I/O processes – see the figure below.

In fact, we implemented this I/O process-based approach in one of the Nanopore data processing software. However, I believe that such a method is just a hack to solve the way around the problem, rather than a proper solution. In addition, the user-space code becomes complicated and difficult to manage when such an approach is implemented in the user-space code. For, instance see the additional amount of complex code that had to be done (below links) to implement this approach in the userspace.

Delighted to see the new improvement and the work put into this document, and thank you for being interested in public opinion. I find all proposed solutions thought provoking. Speaking from H5CPP interest, I would prefer a solution where users may able to select threading runtime, as opposed to compile time, with such constraint that having only single thread selected would not impose observable penalty assuming ~1MB chunk size.

Here are my thoughts in the order of listed in PDF document section 2: Approach (not preference):

  1. …utilizes the a-priori knowledge… : Is it possible to turn synchronisation primitives off by a property list, runtime? H5?set_thread( N ) where N>1 => multithreads? This proposed approach appears to provide good control over threads, and enough level of abstraction to use the planned HDF5 CAPI threading or alternative solution depending on context.
  2. client-server approach: …pool of processes to perform the desired reads as directed by user threads…: Is isn’t it something MPI-IO provides? On H5CLUSTER something similar is done by controlling OrangeFS N IO severs and leads to good throughput.
  3. … is based on pushing multi-threading to HDF5 VFD layer: Does this mean having a clear separation of CAPI code with synchronisation primitives and without? Is it possible to switch between VFD layers runtime?
  4. “… using multi-threaded chunk cache” H5CPP utilises the newly released direct read, write calls and has thread safe chunk cache outside the CAPI. This design allows casting compression filters into embarrassing (independent) parallel problem. It does lead to clean design, and near linear scaling upto available physical cores.

In the hope others have more and better insight, wishing best: steve

I would suggest away from the library being internally threaded and leaving that flexibility up to the user (exception noted at end). If this means one cannot have the expectation of multiple writers to one file simultanously, that is OK - I would rather use VDS to glue together a parallel file than have the complications and performance/flaws of distributed filesystems creep into the performance HDF5 can deliver. That might let me write files in local memory and move it into distributed spaces later for instance, but still end up with something good for archiving and not needing of further post-processing. I don’t think you’ll scratch the performance scratch or variety of configurations with this method either - I’ve brought up previously, what if you have multiple drive or raid partitions you are IOing to, and how do you want to think about tmpfs (RAM) filesystems? The library should not impose unexpected performance or locking/starvation based on how any one independent thread or device is performing if others are independent.

I think the client/server approach that internally uses some form of IPC and processes is interesting but unfortunately I think it’s alot more risky (time sink) than it might seem. You get all of the flaws and problems of trying to do efficient and performant true inter-process communication, which is a very platform-dependent programming with far too much variation to get right. The subtle behavioral differences will also be hard to get right or fully implement & forward. And of course it’s inefficient, even if using shared memory, you have to do quite some extra work and you still end up with copies and synchronization problems.

I am aware of selection based IO, from my own experiences, you will need multiple “reactors” to process this - one will be a bottleneck. Again, fully seperate files/intermediate datastructures completely removes the need for this by letting the user do their own thing per thread; it’s “simpler”. It also makes me think HDF5 is becoming a bit of a networking stack, but I guess if MPI is involved I can see that existing.

Food for thought: Perhaps other 3rd party datastructure libraries could help you substitute out the parts of HDF5 that are old, clunky and not thread-minded. C++ exposed as a c-api gives you modern atomics, thread primatives for instance but it may make implementing or getting some of the other datastructures easier as well. Just saying, you went cmake - this could be a way to cut costs and get a lower timeline. Just from the threading perspective, locking practices are much more structured, safe, and lightweight and cross-platformed. Or not.

Well anyway, I’d work directly towards “Outline of work for full multi-threaded HDF5 library”, and see whatever could be done to reduce risk and fully deliver on that goal in 1 years time.
In the mean time I would hedge and provide a path for posix VFD to be fully thread safe and concurrent with packet-table style writing without filters. If it was possible without tons of work, I’d see if I could get shuffle and gzip to work on there in chunked mode and that’s the end of the road for this hedge/crutch mode while the long term solution is being worked on.

I will have to back track a little and admit, if it came to compressing a huge chunk of data, it could make sense to have a parallel compression, but shuffle and gzip (low) and other faster algorithms really seem to not need this, so it’s kind of a mixed bag. I think I’d stick to simpler and later paralellize compression as an extended effort since it’s easy for users to do other things that address it (say lz4). Perhaps if it is easier to provide the compressed data form to HDF5, this issue can be more side stepped by allowing the user to parallelize the filtering + compression however they see fit? Interesting thought for removing compute from core IO.

Glad this issue is finally being evaluated and being worked on at the levels it is! Wishing funds your way :-). Hopefully DOE etc will take note of how imporatant this should be to them.

I’d like to add an other restricting use case that we have at ESRF: “on thread per file write data with direct chunk”. Hopefully this might be already possible?

I agree with nevion that threading should be the responsibility of the client application (e.g. if every libraries start to use their own thread pools, there is a high risk of oversubscription).

Elena, thank you for sharing with us the subject. As you already know, our HDF5 users are strongly convinced that a multi-thread version of the library is really urgent to reach high throughput in HPC environment.

Most HPC applications use multi threading (via Posix thread or using directive based approaches as OpenMP) in order to process data in parallel so that each thread should have full control of when and where to access and store its own data. In these scenarios, the internal multi-threaded solution proposed in the document would be of no real benefit. We strongly suggest to go for a solution which let the client have explicit control over its threads and have the client the responsibility of calling the HDF5 API with the proper requirements, restrictions or limitations (if any should be introduced to get multi-threading support).

For simplicity and to address the HPC scenario I exposed earlier, let’s restrict our discussion to the use cases where required operations are read only access (see use cases 1.A, 1.B, 2, 3 in Appendix 1). Letting multi-threaded read-only accesses of dataset would be a great improvement of the library in the HPC domain. In these scenario, each thread read from a different dataset, from the same or different HDF5 file. Data is read just once from each dataset, because the intent is processing stored data and, using shared memory, combine its data with other data read by other threads. In these situations there is no real need for caching/hashing mechanism of metadata or chunks behind which, if I understood, the HDF Group identified one of the critical component that makes HDF5 hard to make thread safe. I really don’t see or understand from your document what kind of thread safety problem should arise in the library when accessing different dataset in read only.

We can use the H5F_ACC_RDONLY in the H5Fopen and add to the file access property list a flag with the intention to access datasets via multi-thread. This information can be used to provide dataset from H5Dopen which can turn off the chunk caching mechanism and all related no-threadsafe stuff away, or simply make independent instance of such structures attached to each opened dataset. This solution, which should be very simple, can address all those user cases where external threads would access different datasets from the same, or different, file independently.

I wish I could be of some help and discuss it further.

Thank you for your precious work.

I am here to ask a simple question and trying to find out if this limitation apply to my case.
I created a class to write to HDF5 files. I created three different objects of this class for three different files. I am trying to write to these files using three different buffers.
I am trying to do so by using three different threads. It is not working for me. I can see only one file being written to at a time.
I understand the multithreading limitation affects accessing the same file but does it affect my case described above?