My HDF5 1.8.11 application is running on linux multi-core processor dual Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz and consists of multi-threaded process that at times can have two different threads reading the same hdf5 file for different reasons. Both threads read partial datasets. Both threads do separate H5Fopen on the file for read only and DIRECT_IO, and have their own returned file identifier. Both threads do H5Dread( dset, dtype, H5S_ALL, H5S_ALL, H5P_DEFAULT, ptrForData ) with a typical data request size of ~32MB.
Problem: Occasionally the two threads get in sync and perform H5Dread() on exactly the same partial dataset. When this happens, the application errors out with the following HDF5 error stack:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: H5Dio.c line 182 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#001: H5Dio.c line 539 in H5D__read(): can't initialize I/O info
major: Dataset
minor: Unable to initialize object
#002: H5Dchunk.c line 827 in H5D__chunk_io_init(): unable to create file chunk selections
major: Dataset
minor: Unable to initialize object
#003: H5Dchunk.c line 1301 in H5D__create_chunk_file_map_hyper(): can't insert chunk into skip list
major: Dataspace
minor: Unable to insert object
#004: H5SL.c line 989 in H5SL_insert(): can't create new skip list node
major: Skip Lists
minor: Unable to insert object
#005: H5SL.c line 669 in H5SL_insert_common(): can't insert duplicate key
major: Skip Lists
minor: Unable to insert object
Note that we have implemented a mutex to the hdf5 library calls to guarantee that no thread accesses the library calls simultaneously. Furthermore, our hdf5 wrapper class performs a set of related hdf5 library calls as an atomic sequence, for example to read a partial dataset, in context of one mutex get, the sequence of H5 calls to H5Dopen2, H5Dget_type, H5Dget_space, H5Sget_simple_extent_ndims, H5Sget_simple_extent_dims, H5Sselect_hyperslab, H5Screate_simple, H5Dread, ... where as entire sequence is performed.
Our investigation suggests the error only occurs if the two threads happen to align with the exact same H5Dread request, which is serialized given our library mutex. The second identical H5Dread errors out as indicated in above error stack. The failure is very difficult to reproduce, suggesting a small timing window of opportunity.
Can anyone share explanation of the error and/or possible ways to prevent?
What is a H5 'skip list'?
The H5SL.c H5SL_insert() header comment suggests:
"COMMENTS, BUGS, ASSUMPTIONS
Inserting an item with the same key as an existing object fails."
Is this a known bug?
Regards,
Mike C.