Dear experts…
I submitted a job to a PBS-professional managed queue on our supercomputer. The job is a Monte-Carlo calculation done using MPI and OpenMP and the obtained data are stored with HDF-2.0.0 library (SWMR): no inter-process communication between MPI PEs and therefore sequential HDF5 library is used, and each PE writes its data to its own file.
Unfortunately, the job was killed unfinished because it didn’t finish within the allowed time to the queue.
All the resultant files cannot be read using e.g. h5ls. h5dump —enable-error-stack=2 to one of the file, 20251217.h5.2, yields the following message.
HDF5-DIAG: Error detected in HDF5 (2.0.0) thread 1:
#000: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5F.c line 821 in H5Fopen(): unable to synchronously open file
major: File accessibility
minor: Unable to open file
#001: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5F.c line 782 in H5F__open_api_common(): unable to open file
major: File accessibility
minor: Unable to open file
#002: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLcallback.c line 3869 in H5VL_file_open(): open failed
major: Virtual Object Layer
minor: Can’t open object
#003: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLcallback.c line 3718 in H5VL__file_open(): open failed
major: Virtual Object Layer
minor: Can’t open object
#004: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLnative_file.c line 128 in H5VL__native_file_open(): unable to open file
major: File accessibility
minor: Unable to open file
#005: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Fint.c line 2238 in H5F_open(): file is already open for write (may use to clear file consistency flags)
major: File accessibility
minor: Unable to open file
h5dump error: unable to open file “20251217.h5.2”
These files have finite sizes, e.g.
-rw-r–r–. 1 furutaka furutaka 12026722 Dec 20 15:29 20251217.h5.2
I did h5clear -s to one of the files: after that, doing h5ls to the file yields **NOT FOUND**.
Is there any way to recover these files?
Thanks in advance.
Kazuyoshi
What happens if you use h5dump –enable-error-stack=2 on the file after h5clear -s?
Thanks for the response.
as follows…
HDF5-DIAG: Error detected in HDF5 (2.0.0):
#000: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5O.c line 1154 in H5Oget_info_by_name3(): can’t synchronously retrieve object info
major: Object header
minor: Can’t get value
#001: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5O.c line 1129 in H5O__get_info_by_name_api_common(): can’t get data model info for object
major: Object header
minor: Can’t get value
#002: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLcallback.c line 6088 in H5VL_object_get(): get failed
major: Virtual Object Layer
minor: Can’t get value
#003: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLcallback.c line 6056 in H5VL__object_get(): get failed
major: Virtual Object Layer
minor: Can’t get value
#004: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5VLnative_object.c line 269 in H5VL__native_object_get(): object not found
major: Object header
minor: Object not found
#005: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Gloc.c line 755 in H5G_loc_info(): can’t find object
major: Symbol table
minor: Object not found
#006: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Gtraverse.c line 846 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#007: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Gtraverse.c line 766 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Can’t move to next iterator location
#008: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Gloc.c line 716 in H5G__loc_info_cb(): can’t get object info
major: Symbol table
minor: Can’t get value
#009: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Oint.c line 2119 in H5O_get_info(): unable to load object header
major: Object header
minor: Unable to protect metadata
#010: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Oint.c line 1016 in H5O_protect(): unable to load object header
major: Object header
minor: Unable to protect metadata
#011: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5AC.c line 1303 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata
#012: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Centry.c line 3154 in H5C_protect(): can’t load entry
major: Object cache
minor: Unable to load metadata into cache
#013: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Centry.c line 1227 in H5C__load_entry(): incorrect metadata checksum after all read attempts
major: Object cache
minor: Read failed
#014: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Ocache.c line 185 in H5O__cache_get_final_load_size(): can’t deserialize object header prefix
major: Object header
minor: Unable to decode value
#015: /home/furutaka/work/HDF5/hdf5-2.0.0/src/H5Ocache.c line 1101 in H5O__prefix_deserialize(): bad object header version number
major: Object header
minor: Wrong version number
h5dump error: internal error (file /home/furutaka/work/HDF5/hdf5-2.0.0/tools/src/h5dump/h5dump.c:line 1548)
By the way, the beginning of the file (left after the killed job) before h5clear -s looked like
and then after h5clear -s,
For reference, in the middle of a Monte-Carlo calculation (another file, sorry):
and the one after the job finished properly (i.e. the job was not killed in the middle) :
As seen in the last image, I made two dataset at the root level, one named regions (a simple 1-D array), and T-eDepEv (2-D, semi-infinite array). The former is a simple and fixed one and is written before starting the Monte-Carlo calculation. The latter stores event-by-event data in the calculation and will be expanded during the calculation.
It seems to me that there’s no superblocks and headers in the failed files (and files in the middle of calculation) (and therefore no way to salvage the files
).
But… When are these information written?
p.s. the information on this page seems a bit outdated (and contains “> in the examples).
That doesn’t look like an issue with the superblock, but with an object header (possibly for a dataset). Normally object headers shouldn’t be modified once SWMR write mode is enabled. The way I would approach this is to first find the address of the problematic object header and then look at it in a hex editor to figure out what’s wrong with it. It looks like, from the stack, that either the checksum or the object header version are incorrect, or both (it’s possible the whole thing is zeroes or garbage). If it cannot be recovered, and it is the object header of the dataset you’re looking for, you may be able to find the chunk index by searching for the signature in the file. By default (without adjusting version bounds), this will be a v1 b-tree with signature “TREE”, and you can then use this tree (if it is the right one) to find the chunks with your data.
To find the object header, you could use the low level h5debug tool, a custom program that prints out link info before trying to follow the link, or debugging/printf statements in the library. h5debug may also be useful if are unable to fix the object header but you are able to find the b-tree.
You may then be able to repair the object header by creating a similar file separately and copying the bytes for the dataset object header to the original file, adjusting the index address and ohdr checksum as necessary. If you can exactly recreate the file (without raw data) the index address may be identical. In this case all you’ll need to do is get the address from the recreated file and copy the object header to the old file.
Looking at the hex dumps, it looks like nothing but the superblock was actually flushed to the file. The only file operations that are currently allowed in SWMR mode are raw data writes and H5Dset_extent(). H5Fstart_swmr_write() should flush the file. Are you creating all the file metadata prior to calling H5Fstart_swmr_write()?