Asynchronous writes with HDF5?

samuel.debionne · February 18, 2020, 6:09pm

This question has been raised several times in the past, I thought I’d better open a topic rather than revive one of the older threads, I am sorry if that is not the HDF policy.

Well, everything is in the title, is there any plan/progress with a SEC2_AIO VFD?

I read “Asynchronous I/O and Multi-threaded Concurrency in HDF5: Paths Forward” from Quincey Koziol and the future looks bright! But the document is from 2014.

Asynchronous I/O would be very useful in the context of a task management system using a thread pool where I/O and processing are mixed. Blocking tasks in a thread pool are a waste of resources, and, for instance, Intel TBB has introduced, in the Flow Graph API, a new kind of async_node to try to tackle with this issue.

For our specifics needs, I think that implementing async operation on direct chunks (write) would be enough, no need to have an async variant for all the HDF5 API.

I found an ASYNC VOL on bitbucket but I am not sure if that would fit our needs. Is it compatible with the native VFDs?

Thanks,
Sam

koziol · February 18, 2020, 7:17pm

Hi Sam,
We’re actively working on the async VOL connector and would be interested in hearing more about your use case. Are you working on an HPC application, or something else?

Regards,

	Quincey Koziol

	koziol@lbl.gov

samuel.debionne · February 19, 2020, 5:02pm

Hi Quincey,

Glad to hear you are actively working on it! My use case is basically high performance data acquisition where saving is, most of the time, the bottleneck. We do online analysis as well, but DAO is designed to be real-time. Our library uses direct chunk to bypass some HDF5 core layers. Taks management is based on Intel TBB the graph flow with basically two nodes (for saving): one for compression and one for chunk writing. The concurrency of the write node is limited to one per output file since HDF5 has a global mutex. Since TBB uses a thread pool ti implement the task management, any blocking operation on a worker thread is an issue, hence the interest in an asynch API for writing chunks. That would allow the worker thread to interleave some computation between two write operations.

It this the kind of info you werer looking for? Let me know if you need something more specific…

I also read about the development in the context of the “Fast Forward” projects with Intel? Did any of these developments make it to the official HDF5 releases?

koziol · February 19, 2020, 7:40pm

Hi Sam,
Yes, that’s the sort of information I was interested in, thanks.

I’m not certain that asynchronous I/O (or even relaxing the global lock in HDF5 and making the library fully thread concurrent) would help your situation much.  Async I/O (or multiple user-level threads performing I/O) is only going to be a big win when there’s enough “compute” to completely overlap with the I/O operations.  If all you have is streamed I/O, it’s possible that async I/O would be good, to a point, by guaranteeing that the I/O pipeline was consistently full, but after you have enough I/O queued up, you are going to be limited by hardware (either the storage bandwidth or memory capacity).   The compression tasks will help, but do you know the total time for compressing vs. the total time for I/O?  If the compression time is << I/O time, there’s not a lot of gain in async I/O.

Quincey

samuel.debionne · February 20, 2020, 2:22pm

Hi Quincey,
Thank you for your inputs, I need to benchmark further to make the right decisions. My previous answer was probably misleading, we have other processing tasks than compression in our pipeline (e.g. limited image processing).

after you have enough I/O queued up, you are going to be limited by hardware

You are right, I am counting on the fact that the I/O burst buffers are well-dimensioned by our IT geniuses! Again more benchmarks are needed…

Just to be sure, the “global” lock in HDF5 is at the file level, right?

Regarding the ASYNC VOL, would you say that the project is ready enough to be tested?

steven · February 20, 2020, 2:53pm

Hello, great thread!
and apologies to drop into the conversation. If C++ is an option for you I provide an alternative dataprocessing pipeline in H5CPP for h5::append and regular IO based on direct chunk write. AFAIK it takes a regular SSD based hardware to its limit. Also mulithreaded compressors/filters can be added.

h5::append offers a user-space buffer for tiny fragments, as well as breaking up larger thank chunk size blocks for direct chunk processing. According to my measurements, without compression it already places IO throughout in the ballpark of the underlying filesystem.

This mechanism can be further improved by adding threads for compression, which makes sense on platforms with extra (available) processing power. Since the chunks are independently compressed the parallelism good.
This approach delegates thread control to user level, as you are responsible to protect the internal buffers. The actual IO is done on a single thread and it scales well.

The mechanism can be enabled with h5::high_throughput custom data access property, and requires linking against version 1.10.4 or higher.

For best experience it is good idea to have HDF5 lib compiled without internal locking, and do thread level control at higher level.

if any interest, get in touch and I help you to tune it onto your hardware.
steve

samuel.debionne · February 21, 2020, 4:55pm

Hi Steve,
Thank you for the head up on H5CPP. Actually your project was mentioned at the last HDF5 European Workshop in Grenoble.

For best experience it is good idea to have HDF5 lib compiled without internal locking, and do thread level control at higher level.

That is interesting! Did you measurement any significant performance loss with --enable-thread? Most of the binary package are compiled with this option (we use Conda for instance). But since we have complete control over the thread at the app level we could definitely recompile without it.

h5::append offers a user-space buffer for tiny fragments, as well as breaking up larger thank chunk size blocks for direct chunk processing.

With compression, does any of the h5::append tweaks that you mention apply? We are already compressing in parallel, but limiting the concurrency of the saving to one thread.

steven · February 24, 2020, 4:21am

I did not measure it. Pulling additional code when not in use is generally not recommended. I am excitedly waiting for the new thread model for the CAPI and measure its performance. OTOH C++ thread model has gone through changes and implementing multi threaded behaviour in a portable fashion became less complex. As for cost, AFAIK ~25ns per mutex call.

This is the model I vouch for on single drive/serial file systems; as this approach saturates the underlying filesystem – according to my study the HDF5 CLIB direct chunk calls do rather well.

Yes. The current single threaded implementation is simple, and fast; correct up to 7 dimensions.

To convert blocks into smaller chunks with correct edge handleing: H5Zpipeline.hpp (9.7 KB)
Single threaded 'basic_pipeline` implementation H5Zpipeline_basic.hpp (2.5 KB)
Filter/compressor dispatch to C libraries: H5Zall.hpp (3.6 KB)

steve

samuel.debionne · February 25, 2020, 3:32pm

@koziol Hi Quincey, could you confirm that it is safe to --disable-threadsafe when writing to two different files from two different threads?

koziol · February 25, 2020, 3:45pm

Hi Sam,
The global lock is around the entire library, not individual files. Therefore, only one thread can be in the HDF5 library, no matter which API routine is called (or what file the operation might be on).

Yes, the async VOL is ready to test - it’s in sync w/the 1.12.0 release and we are running benchmarks with it currently, while polishing it to full production quality.  (I can’t remember if I mentioned this before, but you’ll need the ‘async’ branch in the HDF5 git repo to use the async VOL connector)

Quincey

koziol · February 25, 2020, 3:47pm

Hi Sam,
Apologies for the delay in replying to your earlier email. To reiterate that response: no, it’s not safe to disable the threadsafe option when writing to different files. It must be enabled for any multi-threaded access to HDF5 API calls.

Quincey

nevion · February 26, 2020, 7:49am

@steven actually been meaning to ask, is anything about h5cpp known to be thread-unsafe ontop of thread-safe hdf5? Like I wouldn’t expect 2 threads using the same object user side to be threadsafe, but I mean is there any shared/global state that will get mucked up if h5cpp is used from multiple threads on different files?

steven · February 26, 2020, 4:33pm

@nevion No. H5CPP is thread safe when used with a thread safe HDF5 Library.

No, H5CPP doesn’t introduce any new global state. but does rely on the CAPI internal states for object reference counting, etc…

From the code base HDF5 CAPI relies in skip lists to associate hid_t handles with internal objects, and also features datastructures for metadata, indexing chunks, etc… all these may be modelled with a single global variable per property list, object reference, […] that H5CPP methods access when invoked. Since H5CPP hid_t types are binary compatible with the CAPI any changes to the CAPI delegates to H5CPP – no rewrite is necessary.

Replacing the CAPI calls with thread safe and reentrant code H5CPP guarantees both. Think of H5CPP as a set of context sensitive templates that do the right thing respect to the object being serialized + persisted (or the inverse). They are basically generate expert level sequence of CAPI calls respect to context.

TLDR:

Reentrant vs Threadsafe

The action when a code block is interrupted at any given point, then invoked again is called reentry. This reentry doesn’t necessary have to be parallel in time, although the possibility is not excluded. If this code block has a global mutable state which is modified then the consecutive reentry will leave the state undefined. The code block is said to be reentrant if the state is well defined regardless of interruption and consecutive reentries. Think of software/hardware interrupts, recursive calls being equivalent of reentry and the state as a global variable see: singleton pattern in GoF, static variable, memory reference/pointer . However automatic storage of standard layout types within a method/function are reentrant since each function call has it’s own copy on the current stack.

Thread safety is concerned whether a state of a code block remains well defined when executed from different threads of the OS. This is disjoint from reentry as allows datastructures stored in thread local place, making it safe when ran concurrently but unsafe for reentry as that happens from the same thread.

Lifted from wikipedia here is the cross product of reentry and thread safety explained with examples:

not thread-safe, not reentrant

int tmp; 
// shared/global datastructure:  think of HDF5 skiplists, 
// B-link-tree (balanced sibling linked N-ary Tree) [...] 

void swap(int* x, int* y) { // mutatating operator such as awrite, write, create, acreate, 
// increment/decrement reference counting, basically any HDF5 CAPI call that changes the HDF5 internal state

    tmp = *x;
    *x = *y;
    /* Hardware interrupt might invoke isr() here. */
    *y = tmp;    
}
// inerrupt service routine ISR used throughout the examples
void isr() {
    int x = 1, y = 2;
    swap(&x, &y);
}

thread-safe, not reentrant

thread_local int tmp; // dedicate storage per thread
void swap(int* x, int* y) {
    tmp = *x;
    *x = *y;
    *y = tmp; //Hardware interrupt might invoke isr() here.
}

not thread-safe, reentrant

thread_local int tmp;
void swap(int* x, int* y) {
    int s = tmp; // Save global variable, must be atomic op 
    tmp = *x;
    *x = *y;
    *y = tmp;    // Hardware interrupt might invoke isr() here.
    tmp = s;     // Restore global variable, must be atomic op 
}

thread-safe, reentrant

void swap(int* x, int* y){
    int tmp;     // allocated on stack, different for each function fall
    tmp = *x;
    *x = *y;
    *y = tmp;    // Hardware interrupt might invoke isr() here.
}

When H5CPP is used with a rentrant and thread safe HDF5 library both property will delegate, making H5CPP threadsafe and reentrant.

hope it helps:
steve

koziol · February 26, 2020, 5:54pm

Just to be clear - the HDF5 library can be configured to be threadsafe, but it is not concurrently accessible from multiple threads. It is reentrant in the sense that callbacks made from the HDF5 library into application code that then call back into HDF5 API routines (i.e. from the same thread as the original API routine) will be allowed back into the library.

Quincey

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)