Recently, our customer created a HDF5 file that is about 283G and while attempting to read it, getting following errors:
===
Filename: …/itinglu/wdb/input.emirtap.emir0.wdb
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0: #000: H5O.c line 778 in H5Oget_native_info_by_name(): can’t get native file format info for object: ‘/’
major: Object header
minor: Can’t get value #001: H5VLcallback.c line 5870 in H5VL_object_optional(): unable to execute object optional callback
major: Virtual Object Layer
minor: Can’t operate on object #002: H5VLcallback.c line 5833 in H5VL__object_optional(): unable to execute object optional callback
major: Virtual Object Layer
minor: Can’t operate on object #003: H5VLnative_object.c line 546 in H5VL__native_object_optional(): object not found
major: Object header
minor: Object not found #004: H5Gloc.c line 921 in H5G_loc_native_info(): can’t find object
major: Symbol table
minor: Object not found #005: H5Gtraverse.c line 855 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found #006: H5Gtraverse.c line 769 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Can’t move to next iterator location #007: H5Gloc.c line 877 in H5G__loc_native_info_cb(): can’t get object info
major: Symbol table
minor: Can’t get value #008: H5Oint.c line 2378 in H5O_get_native_info(): can’t retrieve object’s btree & heap info
major: Object header
minor: Can’t get value #009: H5Goh.c line 402 in H5O__group_bh_info(): can’t retrieve symbol table size info
major: Symbol table
minor: Can’t get value #010: H5Gstab.c line 671 in H5G__stab_bh_size(): iteration operator failed
major: B-Tree node
minor: Unable to initialize object #011: H5B.c line 1993 in H5B_get_info(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed #012: H5B.c line 1943 in H5B__get_info_helper(): unable to list B-tree node
major: B-Tree node
minor: Unable to list node #013: H5B.c line 1900 in H5B__get_info_helper(): unable to load B-tree node
major: B-Tree node
minor: Unable to protect metadata #014: H5AC.c line 1312 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata #015: H5C.c line 2346 in H5C_protect(): can’t load entry
major: Object cache
minor: Unable to load metadata into cache #016: H5C.c line 6598 in H5C_load_entry(): Can’t read image*
major: Object cache
minor: Read failed #017: H5Fio.c line 161 in H5F_block_read(): read through page buffer failed
major: Low-level I/O
minor: Read failed #018: H5PB.c line 736 in H5PB_read(): read through metadata accumulator failed
major: Page Buffering
minor: Read failed #019: H5Faccum.c line 212 in H5F__accum_read(): driver read request failed
major: Low-level I/O
minor: Read failed #020: H5FDint.c line 193 in H5FD_read(): addr overflow, addr = 109808716816, size = 544, eoa = 2048
major: Invalid arguments to routine
minor: Address overflowed
h5stat error: unable to traverse objects/links in file “…/itinglu/wdb/input.emirtap.emir0.wdb”
H5tools-DIAG: Error detected in HDF5:tools (1.12.0) thread 0: #000: h5trav.c line 1080 in h5trav_visit(): traverse failed
major: Failure in tools library
minor: error in function #001: h5trav.c line 296 in traverse(): H5Lvisit_by_name failed
major: Failure in tools library
minor: error in function #002: h5stat.c line 749 in obj_stats(): H5Oget_native_info_by_name failed
major: Failure in tools library
minor: error in function
===
I’ve tried to use h5ls and h5dump utilities to debug the issue further but haven’t been able to root cause the issue. Any pointers on how to debug this problem? And how can this type of issue be avoided in the future?
The tool/program can’t locate the root group. This can happen if the producing application crashes and fails to close the file properly. To get an idea of what’s in the file, can you run
strings -n 4 -t d your_file_name | grep -E 'BTHD|BTIN|BTLF|EADB|EAHD|EAIB|EASB|FADB|FAHD|FHDB|FHIB|FRHP|FSHD|FSSE|GCOL|HEAP|OCHK|OHDR|SMLI|SMTB|SNOD|TREE'
I’m running the command now and it is already more than 400K lines. Please let me know if I can send just a few lines – the top 100 - 200 lines, maybe?
This doesn’t look unusual. Maybe let’s take a step back. How did you obtain that error stack? Can you run tools such as h5dump or h5stat on the file? What do
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 0: #000: H5L.c line 1516 in H5Lvisit_by_name2(): link visitation failed
major: Links
minor: Iteration failed #001: H5VLcallback.c line 5173 in H5VL_link_specific(): unable to execute link specific callback
major: Virtual Object Layer
minor: Can’t operate on object #002: H5VLcallback.c line 5136 in H5VL__link_specific(): unable to execute link specific callback
major: Virtual Object Layer
minor: Can’t operate on object #003: H5VLnative_link.c line 364 in H5VL__native_link_specific(): link visitation failed
major: Links
minor: Iteration failed #004: H5Gint.c line 1118 in H5G_visit(): can’t visit links
major: Symbol table
minor: Iteration failed #005: H5Gobj.c line 673 in H5G__obj_iterate(): can’t iterate over symbol table
major: Symbol table
minor: Iteration failed #006: H5Gstab.c line 521 in H5G__stab_iterate(): unable to protect symbol table heap
major: Symbol table
minor: Protected metadata error #007: H5HL.c line 351 in H5HL_protect(): unable to load heap data block
major: Heap
minor: Unable to protect metadata #008: H5AC.c line 1426 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata #009: H5C.c line 2370 in H5C_protect(): can’t load entry
major: Object cache
minor: Unable to load metadata into cache #010: H5C.c line 7209 in H5C__load_entry(): Can’t read image*
major: Object cache
minor: Read failed #011: H5Fio.c line 148 in H5F_block_read(): read through page buffer failed
major: Low-level I/O
minor: Read failed #012: H5PB.c line 721 in H5PB_read(): read through metadata accumulator failed
major: Page Buffering
minor: Read failed #013: H5Faccum.c line 208 in H5F__accum_read(): driver read request failed
major: Low-level I/O
minor: Read failed #014: H5FDint.c line 184 in H5FD_read(): addr overflow, addr = 57873339304, size = 5767168, eoa = 2048
major: Invalid arguments to routine
minor: Address overflowed
h5dump error: internal error (file h5dump.c:line 1471)
H5tools-DIAG: Error detected in HDF5:tools (1.12.2) thread 0: #000: h5tools_utils.c line 795 in init_objs(): finding shared objects failed
major: Failure in tools library
minor: error in function #001: h5trav.c line 1058 in h5trav_visit(): traverse failed
major: Failure in tools library
minor: error in function #002: h5trav.c line 290 in traverse(): H5Lvisit_by_name failed
major: Failure in tools library
minor: error in function
And here is the output from running h5stat:
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 0: #000: H5O.c line 769 in H5Oget_native_info_by_name(): can’t get native file format info for object: ‘/’
major: Object header
minor: Can’t get value #001: H5VLcallback.c line 5824 in H5VL_object_optional(): unable to execute object optional callback
major: Virtual Object Layer
minor: Can’t operate on object #002: H5VLcallback.c line 5788 in H5VL__object_optional(): unable to execute object optional callback
major: Virtual Object Layer
minor: Can’t operate on object #003: H5VLnative_object.c line 535 in H5VL__native_object_optional(): object not found
major: Object header
minor: Object not found #004: H5Gloc.c line 891 in H5G_loc_native_info(): can’t find object
major: Symbol table
minor: Object not found #005: H5Gtraverse.c line 837 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found #006: H5Gtraverse.c line 754 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Can’t move to next iterator location #007: H5Gloc.c line 849 in H5G__loc_native_info_cb(): can’t get object info
major: Symbol table
minor: Can’t get value #008: H5Oint.c line 2323 in H5O_get_native_info(): can’t retrieve object’s btree & heap info
major: Object header
minor: Can’t get value #009: H5Goh.c line 389 in H5O__group_bh_info(): can’t retrieve symbol table size info
major: Symbol table
minor: Can’t get value #010: H5Gstab.c line 649 in H5G__stab_bh_size(): iteration operator failed
major: B-Tree node
minor: Unable to initialize object #011: H5B.c line 1970 in H5B_get_info(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed #012: H5B.c line 1921 in H5B__get_info_helper(): unable to list B-tree node
major: B-Tree node
minor: Unable to list node #013: H5B.c line 1878 in H5B__get_info_helper(): unable to load B-tree node
major: B-Tree node
minor: Unable to protect metadata #014: H5AC.c line 1426 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata #015: H5C.c line 2370 in H5C_protect(): can’t load entry
major: Object cache
minor: Unable to load metadata into cache #016: H5C.c line 7209 in H5C__load_entry(): Can’t read image*
major: Object cache
minor: Read failed #017: H5Fio.c line 148 in H5F_block_read(): read through page buffer failed
major: Low-level I/O
minor: Read failed #018: H5PB.c line 721 in H5PB_read(): read through metadata accumulator failed
major: Page Buffering
minor: Read failed #019: H5Faccum.c line 202 in H5F__accum_read(): driver read request failed
major: Low-level I/O
minor: Read failed #020: H5FDint.c line 184 in H5FD_read(): addr overflow, addr = 109808716816, size = 544, eoa = 2048
major: Invalid arguments to routine
minor: Address overflowed
HDF5-DIAG: Error detected in HDF5 1.12.2) thread 0: #000: H5L.c line 1516 in H5Lvisit_by_name2(): link visitation failed
major: Links
minor: Iteration failed #001: H5VLcallback.c line 5173 in H5VL_link_specific(): unable to execute link specific callback
major: Virtual Object Layer
minor: Can’t operate on object #002: H5VLcallback.c line 5136 in H5VL__link_specific(): unable to execute link specific callback
major: Virtual Object Layer
minor: Can’t operate on object #003: H5VLnative_link.c line 364 in H5VL__native_link_specific(): link visitation failed
major: Links
minor: Iteration failed #004: H5Gint.c line 1118 in H5G_visit(): can’t visit links
major: Symbol table
minor: Iteration failed #005: H5Gobj.c line 673 in H5G__obj_iterate(): can’t iterate over symbol table
major: Symbol table
minor: Iteration failed #006: H5Gstab.c line 521 in H5G__stab_iterate(): unable to protect symbol table heap
major: Symbol table
minor: Protected metadata error #007: H5HL.c line 351 in H5HL_protect(): unable to load heap data block
major: Heap
minor: Unable to protect metadata #008: H5AC.c line 1426 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata #009: H5C.c line 2370 in H5C_protect(): can’t load entry
major: Object cache
minor: Unable to load metadata into cache #010: H5C.c line 7209 in H5C__load_entry(): Can’t read image*
major: Object cache
minor: Read failed #011: H5Fio.c line 148 in H5F_block_read(): read through page buffer failed
major: Low-level I/O
minor: Read failed #012: H5PB.c line 721 in H5PB_read(): read through metadata accumulator failed
major: Page Buffering
minor: Read failed #013: H5Faccum.c line 208 in H5F__accum_read(): driver read request failed
major: Low-level I/O
minor: Read failed #014: H5FDint.c line 184 in H5FD_read(): addr overflow, addr = 57873339304, size = 5767168, eoa = 2048
major: Invalid arguments to routine
minor: Address overflowed
h5stat error: unable to traverse objects/links in file “…/itinglu/wdb/input.emirtap.emir0.wdb”
H5tools-DIAG: Error detected in HDF5:tools (1.12.2) thread 0: #000: h5trav.c line 1058 in h5trav_visit(): traverse failed
major: Failure in tools library
minor: error in function #001: h5trav.c line 290 in traverse(): H5Lvisit_by_name failed
major: Failure in tools library
minor: error in function #002: h5stat.c line 659 in obj_stats(): H5Oget_native_info_by_name failed
major: Failure in tools library
minor: error in function
In both cases, the library attempts to read from addresses beyond the end-of-allocation (eoa), which is rather small (2048) and doesn’t make sense for the file size you’ve quoted. Assuming that the file wasn’t closed properly, it’s likely that certain elements of the superblock weren’t updated. You can obtain a dump of the first 128 bytes by running this:
The file size is 6,208 bytes or 0x1840. Looking at the file format specification you can spot the End of File Address following the Address of File Free space Info, which is ffff ffff ffff ffff, in this example.
We were able to confirm that the writing application was indeed terminated using SIGTERM signal.
I would like to know if you have any guidance on what a reasonable approach would be when the writing application is terminated in this manner.
Should the HDF5 file be removed, and some sort of warning logged so the end user is aware as to what happened?
I looked up the HDF5 documentation and it seems I could call H5Fflush(H5File::getId(), H5F_SCOPE_GLOBAL) to flush the in-memory buffers to disk. Is this recommended?
Both are sensible steps to take. How effective this can be, depends a lot on the specifics of the disruption. If it’s not IO-related and HDF5 library structures (in user space!) weren’t compromised, there’s a good chance that flushing (and closing!) will leave things in a sane state. If it is IO-related, e.g., disk full, failed device, or (temporarily) lost connection, the chances of gracefully exiting might be slim. The assumption should be that the HDF5 library has no logic for “taking evasive action.” If a call fails, it fails, and the error stack will have a record of that, but any retry logic or state sanity assessment is on the application.
Also, we were able to reproduce the abnormal termination of the writing application (which resulted in the corrupt HDF5 file). It is due to an assertion in HDF5 library:
The dataset uses compound data type – POD structs with numeric (size_t, int, float, double), string and boolean values. All our datasets have RANK = 1. No chunking (yet).
In this case, the writing application creates ~4M groups and in each group, there would be 2 sub-groups. At each sub-group, there would be about 20 sub-groups. The above-mentioned datasets reside at this level.
Can you reproduce the error at will? Can you provide us with a reproducer? None of the things you are describing sound unusual. The definition’s comment sounds a little equivocating.
/* This sanity-checking constant was picked out of the air. Increase
* or decrease it if appropriate. Its purpose is to detect corrupt
* object sizes, so it probably doesn't matter if it is a bit big.
*/
#define H5C_MAX_ENTRY_SIZE ((size_t)(32 * 1024 * 1024))
It suggests that we don’t expect cache entries to be big (32 MiB), and nothing from your description suggests anything near that. My hunch is that it has nothing to do with the H5File::createDataSet call but that some corruption (“detect corrupt object sizes”) is occurring in your application or somewhere in the library.
For the given case, I’m able to consistently reproduce the assertion in HDF5 library. I’m not sure if I’d have the bandwidth to try and create a standalone reproducer but will try to do so in the next week or so.
I know that the writer application does not have Valgrind issues like Invalid Read/Write errors. Out of curiosity, I re-ran Valgrind and noticed this:
===
==400120== Syscall param pwrite64(buf) points to uninitialised byte(s)
==400120== at 0x12799FC3: ??? (in /usr/lib64/libpthread-2.17.so)
==400120== by 0x432FAB7: H5FD_sec2_write (H5FDsec2.c:816)
==400120== by 0x43273C8: H5FD_write (H5FDint.c:248)
==400120== by 0x460D996: H5F__accum_write (H5Faccum.c:826)
==400120== by 0x4465781: H5PB_write (H5PB.c:1031)
==400120== by 0x4304040: H5F_block_write (H5Fio.c:251)
==400120== by 0x426A9BA: H5C__flush_single_entry (H5C.c:6109)
==400120== by 0x4272611: H5C__make_space_in_cache (H5C.c:6961)
==400120== by 0x42735A7: H5C_insert_entry (H5C.c:1458)
==400120== by 0x423B279: H5AC_insert_entry (H5AC.c:810)
==400120== by 0x43ED434: H5O__apply_ohdr (H5Oint.c:548)
==400120== by 0x43F40DA: H5O_create (H5Oint.c:316)
==400120== by 0x42A6D53: H5D__update_oh_info (H5Dint.c:1030)
==400120== by 0x42A9C64: H5D__create (H5Dint.c:1373)
==400120== by 0x46071A5: H5O__dset_create (H5Doh.c:300)
==400120== by 0x43F1FB9: H5O_obj_create (H5Oint.c:2521)
==400120== by 0x43AB717: H5L__link_cb (H5L.c:1850)
==400120== by 0x43651E9: H5G__traverse_real (H5Gtraverse.c:629)
==400120== by 0x4365F80: H5G_traverse (H5Gtraverse.c:854)
==400120== by 0x43A37ED: H5L__create_real (H5L.c:2044)
==400120== by 0x43AD96E: H5L_link_object (H5L.c:1803)
==400120== by 0x42A8E28: H5D__create_named (H5Dint.c:410)
==400120== by 0x45A9051: H5VL__native_dataset_create (H5VLnative_dataset.c:74)
==400120== by 0x458409F: H5VL__dataset_create (H5VLcallback.c:1834)
==400120== by 0x458E19C: H5VL_dataset_create (H5VLcallback.c:1868)
==400120== by 0x42991AC: H5Dcreate2 (H5D.c:150)
==400120== by 0x41FBB9C: H5::H5Location::createDataSet(char const*, H5::DataType const&, H5::DataSpace const&, H5::DSetCreatPropList const&, H5::DSetAccPropList const&, H5::LinkCreatPropList const&) const (H5Location.cpp:932)
==400120== by 0x41FBD78: H5::H5Location::createDataSet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, H5::DataType const&, H5::DataSpace const&, H5::DSetCreatPropList const&, H5::DSetAccPropList const&, H5::LinkCreatPropList const&) const (H5Location.cpp:958)
…
// Rest of writer application stack
===
Is this something that should be addressed? If so, could you suggest how? There is only one occurrence of this issue that Valgrind reports.
(The experts will correct me…) I think this is nothing to lose sleep over. When a new object (e.g., dataset) is created, it’s linked into the group structure, and an object header is created. Furthermore, the metadata cache is updated to have things on hand when needed. If you dig into the code, various structures with array fields may get only partially initialized (the arrays). I think that’s what valgrind is calling out here.
(OK, we don’t have the state of the metadata cache in your application…)
We could try to reproduce the error by creating just the dataset you’re dealing with. What’s the type and shape of that dataset, and what are the creation properties?