Hi. I’m using HDF5-cpp to store results of a computation on a cluster (large matrices among other things).
What I find is that occasionally, the code crashes with the error I included at the end of this message.
The key (probably) is that the crash only occurs when I send binary that performs the calculations (and eventually outputs .h5 files with results) to be ran using slurm on a computational node within the same cluster. Note that the crash doesn’t always happen, which is why it has been difficult to debug. The same binary will run however without crashing if I simply run it.
Any helps or hints would be greatly appreciated. Thanks!
Here is the detailed error message:
HDF5-DIAG: Error detected in HDF5 (1.14.1-2) thread 0:
#000: H5D.c line 480 in H5Dclose(): can't decrement count on dataset ID
major: Dataset
minor: Unable to decrement reference count
#001: H5Iint.c line 1263 in H5I_dec_app_ref_always_close(): can't decrement ID ref count
major: Object ID
minor: Unable to decrement reference count
#002: H5Iint.c line 1233 in H5I__dec_app_ref_always_close(): can't decrement ID ref count
major: Object ID
minor: Unable to decrement reference count
#003: H5Iint.c line 1107 in H5I__dec_app_ref(): can't decrement ID ref count
major: Object ID
minor: Unable to decrement reference count
#004: H5Dint.c line 297 in H5D__close_cb(): unable to close dataset
major: Dataset
minor: Close failed
#005: H5VLcallback.c line 2808 in H5VL_dataset_close(): dataset close failed
major: Virtual Object Layer
minor: Can't close object
#006: H5VLcallback.c line 2771 in H5VL__dataset_close(): dataset close failed
major: Virtual Object Layer
minor: Can't close object
#007: H5VLnative_dataset.c line 813 in H5VL__native_dataset_close(): can't close dataset
major: Dataset
minor: Unable to decrement reference count
#008: H5Dint.c line 1886 in H5D_close(): unable to flush cached dataset info
major: Dataset
minor: Write failed
#009: H5Dint.c line 3223 in H5D__flush_real(): unable to flush raw data
major: Dataset
minor: Unable to flush data from cache
#010: H5Dcontig.c line 1602 in H5D__contig_flush(): unable to flush sieve buffer
major: Dataset
minor: Unable to flush data from cache
#011: H5Dint.c line 3189 in H5D__flush_sieve_buf(): block write failed
major: Low-level I/O
minor: Write failed
#012: H5Fio.c line 190 in H5F_shared_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#013: H5PB.c line 1017 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#014: H5Faccum.c line 832 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#015: H5FDint.c line 306 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#016: H5FDsec2.c line 856 in H5FD__sec2_write(): file write failed: time = Mon Jul 3 22:35:54 2023
, filename = './dat__U2_V0p5_TP-0p2_Mu-0p8_KDIM16_FineMoms6400_RefdMoms256_FFShellCount1p5_BETA5_C4_T_START5_Square Hubbard_OMFL_VANHOVE_ALLSYMM/1.h5', file\
descriptor = 3, errno = 5, error message = 'Input/output error', buf = 0x92119438, total write size = 65536, bytes this sub-write = 65536, bytes actually w\
ritten = 18446744073709551615, offset = 0
major: Low-level I/O
minor: Write failed
terminate called after throwing an instance of 'H5::DataSetIException'
/var/spool/slurmd/job49109/slurm_script: line 30: 1018 Aborted $PROG