We’ve run into a strange, intermittent problem with some HDF5 files.
A large number of files are created while processing data. When interacting with those files much later, a small fraction of those files (< 10%) cannot be operated on by the current version of h5 tools (h5ls, h5stat, h5dump, etc.). Strangely, this only happens with the original files; a copy of the file works fine. Also, only the current version of the h5 tools, such as h5ls 1.10.6, has trouble; h5ls 1.8.12 (and related) works fine. The original file and its copy are identical in every way that I can find (md5sum, stat, ls -ls, ls -laZ), except for the inode, modification timestamp, and filename - but even the filename can be changed, and the v1.10 h5 tools still don’t work. Also, lslocks and fuser turn up no locks on the problematic file.
Some digging indicated that it’s probably an issue with an HDF5 lock on the file, implemented I think in HDF5 1.10.x. Disabling file locking with an environment variable, such as HDF5_USE_FILE_LOCKING=FALSE h5ls [badfile].h5
, works for that command, but it doesn’t fix the files. I haven’t been able to figure out where the HDF5 lock is stored (in the superblock? I don’t know where that is stored or how to easily read it), but I did find that h5clear
should be able to clear the lock (e.g., page 10 here: https://support.hdfgroup.org/HDF5/docNewFeatures/SWMR/Design-HDF5-FileLocking.pdf ), however running h5clear -s [badfile].h5
returns the terse h5clear error: h5tools_fopen
, and googling doesn’t turn up anything useful there. Digging into the problematic commands a little more, such as h5stat --enable-error-stack [badfile].h5
, returns lines like these:
“#000: H5F.c line 509 in H5Fopen(): unable to open file”
“#001: H5Fint.c line 1567 in H5F_open(): unable to lock the file”
“#002: H5FD.c line 1640 in H5FD_lock(): driver lock request failed”
"#003: H5FDsec2.c line 959 in H5FD_sec2_lock(): unable to lock file, errno = 11, error message = ‘Resource temporarily unavailable’ "
which seem like they might be helpful to the right person, but that person is not me.
Does anyone have any insight as to what is going on here?
An ideal solution would be a 1-line command to identify locked files (beyond just trying to read a file and checking the exit code) and (more importantly) a 1-line command to remove a lock that should have already been removed. (But we’ll take any hints that someone might have.)