HDF5 data corruption issues after a crash?

js39 · October 7, 2020, 1:49pm

Hi All,

Just being referred to this forum from the h5py google group https://groups.google.com/u/1/g/h5py/c/s_luehojYik/m/tMjCFfOtCQAJ.

I recently experienced HDF5 file corruption issues with some of my HDF5/h5py programs. Therefore I wrote some simple sequential or parallel benchmarks to tested them, as some other users reported similar problems. And it turned out that there are indeed file corruptions due to HDF5 operations.

I just wonder if they are considered as bugs and does our HDF5 community expect to fix them? I did some studies on it and hope this could help developers to deal with them.

The benchmarks in my study involve simple python operations such as
dataset creation: foo.create_dataset()
dataset removal: del f[‘foo’]
dataset rename, dataset resize
on an HDF5 file with several existing groups and datasets.

I tried to emulate crashes in the middle of execution, recover the file with h5clear and many of them cause crash consistency problems. For instance, some existing datasets (which are not modified by the benchmark) are inaccessible after a crash within the creation and deletion of another dataset. The rename and resize of a dataset could also be left inaccessible (e.g. Cannot read data from the resized dataset (wrong B-tree signature)).

Many of these problems remain even if I turned on SWMR mode.

Another thing I did is that I tried to identify their root causes at the level of low-level objects, e.g. a parent b-tree node is not modified together with its child node. I think I will not go into the details here but you can definitely contact me if you find this interesting and useful.

Sincerely,
Jinghan

epourmal · October 8, 2020, 1:57pm

Hello,

Could you please try

h5clear --increment=1 filename

and see if file is still corrupted? Also you may try a different size of increment, for example, 512.

The command above sets EOA to the maximum of (EOA, EOF) + 1M

where EOA is end of allocation and EOF is end of file.

Thank you!

Elena

js39 · October 8, 2020, 4:27pm

Hi Elena,

In my testing, I tried to recover the file with

      h5clear --increment example.h5
      h5clear -s -m example.h5

for each file corruption.
The problems still exist.

epourmal · October 12, 2020, 5:29am

Hi Jinghan,

In general, it is very hard to recover data from a corrupted HDF5 file. This is due to inconsistent state of HDF5 metadata in the file when a program terminates abnormally.

h5clear tool will only help with the files created in SWMR mode. Are you saying that when you interrupt SWMR writer, the file cannot be restored with h5clear?

Please send email to HDF Helpdesk (help@hdfgroup.org) with instructions how to reproduce the problem and we will take a look.

Thank you!

Elena