Multi-threading in the HDF5 library


#1

Dear members of the HDF5 community,

We would like to share with you a document on some approaches we are considering for enabling multi-threading in HDF5.

I am attaching a document that outlines the considered use cases and possible solutions. This is a high-level document and doesn’t go into implementation details. We hope to get your input on the use cases and stir further discussion on implementation approach and maybe get some contributions :smile:Multi-threaded-HDF5-2019-12-09.pdf (129.1 KB)

Thank you!

Elena


#2

I would like to add a use-case with respect to Oxford Nanopore Technologies (ONT) raw signal data processing. Note that here I explain the problem very briefly. if something is unclear or additional information is required, I can provide.

ONT, a third-generation sequencing company, uses HDF5 file format to store their raw data. The use of raw signal data for downstream analysis has been a popular approach among research scientists, thus open-source research tools such as Nanopolish have been developed. Such tools require to efficiently read the raw Nanopore data, as opposed to writing. Nanopore HDF5 data files being terabytes in size, the inefficient multi-threaded access in HDF5 library poses a significant bottleneck in such tools.

We profiled the time spent on HDF5 I/O for performing a Nanopore raw data-based analysis (methylation calling) using a Nanopore PromethION dataset. The profiling was done on a server with 72 Intel CPU cores, 384 GB RAM and HDD RAID array composed of 12 spin disks (RAID level 6). The processing of data was done using all available CPU cores, while reading of HDF5 was performed with only a single thread. Unfortunately, data processing took only 2.92 hours while HDF5 I/O took 69.27 hours.

To verify that the above mentioned high I/O time is due to the limitation in HDF5 library, rather than something else, we conducted an experiment where we read a Nanopore dataset (not the above complete dataset, we used a smaller dataset to keep the runtime relatively smaller) first using multiple threads and then using multiple processes. For the case with multiple threads, we launched a thread pool and the main thread assigned the HDF5 reads to this thread pool. As expected, there was no improvement in the I/O time (even got worse on SSD) – see the figure below. For, the case with multiple processes we launched a process pool and the parent process assigned its HDF5 reads to these children processes. Inter-process communication was performed with unnamed pipes. As expected, the I/O performance got dramatically better with the number of I/O processes – see the figure below.

In fact, we implemented this I/O process-based approach in one of the Nanopore data processing software. However, I believe that such a method is just a hack to solve the way around the problem, rather than a proper solution. In addition, the user-space code becomes complicated and difficult to manage when such an approach is implemented in the user-space code. For, instance see the additional amount of complex code that had to be done (below links) to implement this approach in the userspace.
https://github.com/hasindu2008/f5c/blob/cea05f7e9bb5c7cf5404787ef2b3369f290357b6/src/f5c.c#L29-L308
https://github.com/hasindu2008/f5c/blob/cea05f7e9bb5c7cf5404787ef2b3369f290357b6/src/f5c.c#L782-L934


#3

Delighted to see the new improvement and the work put into this document, and thank you for being interested in public opinion. I find all proposed solutions thought provoking. Speaking from H5CPP interest, I would prefer a solution where users may able to select threading runtime, as opposed to compile time, with such constraint that having only single thread selected would not impose observable penalty assuming ~1MB chunk size.

Here are my thoughts in the order of listed in PDF document section 2: Approach (not preference):

  1. …utilizes the a-priori knowledge… : Is it possible to turn synchronisation primitives off by a property list, runtime? H5?set_thread( N ) where N>1 => multithreads? This proposed approach appears to provide good control over threads, and enough level of abstraction to use the planned HDF5 CAPI threading or alternative solution depending on context.
  2. client-server approach: …pool of processes to perform the desired reads as directed by user threads…: Is isn’t it something MPI-IO provides? On H5CLUSTER something similar is done by controlling OrangeFS N IO severs and leads to good throughput.
  3. … is based on pushing multi-threading to HDF5 VFD layer: Does this mean having a clear separation of CAPI code with synchronisation primitives and without? Is it possible to switch between VFD layers runtime?
  4. “… using multi-threaded chunk cache” H5CPP utilises the newly released direct read, write calls and has thread safe chunk cache outside the CAPI. This design allows casting compression filters into embarrassing (independent) parallel problem. It does lead to clean design, and near linear scaling upto available physical cores.

In the hope others have more and better insight, wishing best: steve


#4

I would suggest away from the library being internally threaded and leaving that flexibility up to the user (exception noted at end). If this means one cannot have the expectation of multiple writers to one file simultanously, that is OK - I would rather use VDS to glue together a parallel file than have the complications and performance/flaws of distributed filesystems creep into the performance HDF5 can deliver. That might let me write files in local memory and move it into distributed spaces later for instance, but still end up with something good for archiving and not needing of further post-processing. I don’t think you’ll scratch the performance scratch or variety of configurations with this method either - I’ve brought up previously, what if you have multiple drive or raid partitions you are IOing to, and how do you want to think about tmpfs (RAM) filesystems? The library should not impose unexpected performance or locking/starvation based on how any one independent thread or device is performing if others are independent.

I think the client/server approach that internally uses some form of IPC and processes is interesting but unfortunately I think it’s alot more risky (time sink) than it might seem. You get all of the flaws and problems of trying to do efficient and performant true inter-process communication, which is a very platform-dependent programming with far too much variation to get right. The subtle behavioral differences will also be hard to get right or fully implement & forward. And of course it’s inefficient, even if using shared memory, you have to do quite some extra work and you still end up with copies and synchronization problems.

I am aware of selection based IO, from my own experiences, you will need multiple “reactors” to process this - one will be a bottleneck. Again, fully seperate files/intermediate datastructures completely removes the need for this by letting the user do their own thing per thread; it’s “simpler”. It also makes me think HDF5 is becoming a bit of a networking stack, but I guess if MPI is involved I can see that existing.

Food for thought: Perhaps other 3rd party datastructure libraries could help you substitute out the parts of HDF5 that are old, clunky and not thread-minded. C++ exposed as a c-api gives you modern atomics, thread primatives for instance but it may make implementing or getting some of the other datastructures easier as well. Just saying, you went cmake - this could be a way to cut costs and get a lower timeline. Just from the threading perspective, locking practices are much more structured, safe, and lightweight and cross-platformed. Or not.

Well anyway, I’d work directly towards “Outline of work for full multi-threaded HDF5 library”, and see whatever could be done to reduce risk and fully deliver on that goal in 1 years time.
In the mean time I would hedge and provide a path for posix VFD to be fully thread safe and concurrent with packet-table style writing without filters. If it was possible without tons of work, I’d see if I could get shuffle and gzip to work on there in chunked mode and that’s the end of the road for this hedge/crutch mode while the long term solution is being worked on.

I will have to back track a little and admit, if it came to compressing a huge chunk of data, it could make sense to have a parallel compression, but shuffle and gzip (low) and other faster algorithms really seem to not need this, so it’s kind of a mixed bag. I think I’d stick to simpler and later paralellize compression as an extended effort since it’s easy for users to do other things that address it (say lz4). Perhaps if it is easier to provide the compressed data form to HDF5, this issue can be more side stepped by allowing the user to parallelize the filtering + compression however they see fit? Interesting thought for removing compute from core IO.

Glad this issue is finally being evaluated and being worked on at the levels it is! Wishing funds your way :-). Hopefully DOE etc will take note of how imporatant this should be to them.