A minor step towards thread concurrency


#1

I took a look at https://www.hdfgroup.org/2020/11/webinar-enabling-multithreading-concurrency-in-hdf5-community-discussion/

It got me looking at the lock implementation more closely.

Specifically made me think for the sec2 file io, to release the lock during those times like python GIL does with c extensions for file io. So there’s a recursive lock in the library though right (using a counter where only values transitioning to 1 or 0 change thread lock state)… so what about this algorithm:

another locking routine - suspend and corresponding restore, parameter is suspend_context
grabs the atomic_lock
asserts this thread is the owner
saves to a struct the lock_count
sets lock_count to 0, performs pthread_cond_signal on cond_var
unlock mutex

restore - takes the context
grabs the atomic_lock, waits with while lock_count on cond_var
sets lock_count back from suspend context
unlock mutex

Anything horrifically broken?

I figure this could result in caches getting updated twice in the case a user touches the same chunks in parallel but I also figure if a user is using different files and different datasets such risks are minimized or non-existent.
Quincey Koziol hang out here?


#2

Hi!

I guess I’m a little unclear on what you are proposing, being unfamiliar with the python GIL. Are you suggesting the lock get (temporarily) dropped whenever we are performing file I/O?

I don’t think that would work for HDF5. The library is full of unprotected library state that could get corrupted by multiple threads concurrently updating it. Something like that may work for user-level I/O in python (do they maintain a state-free exit path after I/O?) where the state of the interpreter or I/O library doesn’t depend on the bytes a user wants to read or write, but the HDF5 library has to maintain a coherent “file state” that is kept both in memory and on storage. Ergo, metadata operations are critical operations and allowing interruptions while we are updating or inspecting file metadata is almost certainly a bad idea.

As examples, stale writes from a thread suspended between lock release and I/O could scribble stale metadata after other threads had also made changes. Reading metadata could break if one thread read metadata from the disk, were suspended before it could reacquire the lock, and then another thread were able to independently read the same metadata and change it. I’d have to look at the code to see if that situation would result in an error from trying to add an already existing metadata object to the metadata cache, or if stale metadata would be decoded and inserted. Neither would be great. Even if we were not modifying the metadata, as in a file opened read-only, the cache code would probably have to be careful about the “double insert” issue (which I believe you are alluding to in your suggestion).

Perhaps we could do something like this for dataset data in files that have been opened read-only, but this would be complicated by variable-length data, which is stored in metadata structures but treated as raw data at the VFD level. Even if we ignore variable-length data, though, I’d be worried that any copied global state in the H5Dread/write path could have been invalidated by operations that took place in other threads, causing subsequent corruption or failures.

It may be that something like this is lower-hanging fruit that could be addressed early as a part of a concerted effort to add true concurrency to the HDF5 library, though, even if it were not entirely workable or too risky to bolt on ad-hoc.


#3

Yes - on specific IO calls for dread/dwrite we’d release the lock during the OS syscalls and reacquire it on those finishing. Perhaps filtering as well, but that’s one more step.

The most gains / path of concern is mainly dread/dwrite - assuming a dataset has already been created/types commited and all other IO operations wait on the lock normally, what metadata is in concern with respect to either every thread uses a different file or every thread uses a different dataset (both are of interest) and writing data of the dataset, or reading?

Vlens also can be excluded if they cause trouble in some way; generally orthogonal to performance anyway.

I believe this is either all-at-once writes or reads of a dataset or chunk’d writes, exclusively - same dataset never being read & written to. So metadata will always be written / read with the lock held, it could just be stale in certain conditions - an invalid configuration would be a dataset appended twice concurrently. I don’t know how this would work to multiple writers in a given file though I don’t need that for this to be good gains.

Is it easy to differentiate reads and writes for metadata? I know the dread and dwrite paths aren’t terribly hard to alter to provide differentiating information to the file driver / sec2; perhaps not even needing to actually cross over there.

May just try it one of these days and see how hard it crashes and burns but definitely keen on knowing what I don’t know wrt all the layers ontop / various caches and what/how sensitive stale updates, if they even happen per user usage guarantees.