Corrupted file due to shutdown


#1

Hello!

I have a problem with a HDF5 file created with h5py. I have the suspicion that the process writing to the file was killed.

Problem
I have a HDF5 file created with h5py. Unfortunately I cannot read the data within because I get several errors when trying to access the data with their path:

  • Unable to open object (address of object past end of allocation)
  • Unable to get group info (addr overflow, addr = 44103764, size = 544, eoa = 2048)

Additional Info
I suspect the process writing to the file was killed (thus the file not properly closed).
System: Ubuntu 20.04
hdf5-utils info:
$ h5dump file.hd5
h5dump error: internal error (file …/…/…/…/…/tools/src/h5dump/h5dump.c:line 1493)
$ h5debug file.hd5
Reading signature at address 0 (rel)
File Super Block…
File name (as opened): file.hd5
File name (after resolving symlinks): file.hd5
File access flags 0x00000000
File open reference count: 1
Address of super block: 0 (abs)
Size of userblock: 0 bytes
Superblock version number: 0
Free list version number: 0
Root group symbol table entry version number: 0
Shared header version number: 0
Size of file offsets (haddr_t type): 8 bytes
Size of file lengths (hsize_t type): 8 bytes
Symbol table leaf node 1/2 rank: 4
Symbol table internal node 1/2 rank: 16
Indexed storage internal node 1/2 rank: 32
File status flags: 0x01
Superblock extension address: UNDEF (rel)
Shared object header message table address: UNDEF (rel)
Shared object header message version number: 0
Number of shared object header message indexes: 0
Address of driver information block: UNDEF (rel)
Root group symbol table entry:
Name offset into private heap: 0
Object header address: 96
Cache info type: Symbol Table
Cached entry information:
B-tree address: 136
Heap address: 680

Attempt
So I extended eoa with: h5clear --increment file.hd5

Now I get the following error (only this one).
Link iteration failed (bad local heap signature)

Reconstructing
Is it possible to reconstruct the data in the file?
I know exactly what the single datasets are. (size and type).
Is it possible to index the datasets differently than with their name. Maybe with the memory offset from their parent.

I would greatly appreciate any help. Thank you!


#2

What does

strings -t d file.hd5

show?

G.


#3

Thank you for the answer and sorry for the late reply (I was ill).

The command you suggested seems to be able to read the file. It does not throw any exceptions and puts out what seems to be memory addresses and its contents. Also it seems to be able to show the structure. I had to zip the output to be able to upload it (strings.zip (8.0 MB)).

The structure of the file should be as follows:

thermal_evo_2/
├─ 1/
│ ├─ Float Array 32x32
├─ 2/
│ ├─ Float Array 32x32
├─ ...
├─ n/
│ ├─ Float Array 32x32
thermal_evo_1/
├─ 1/
│ ├─ Float Array 32x32
├─ 2/
│ ├─ Float Array 32x32
├─ ...
├─ n/
│ ├─ Float Array 32x32
thermal_cts_1/
├─ 1/
│ ├─ Float Array 15x36
├─ 2/
│ ├─ Float Array 15x36
├─ ...
├─ m/
│ ├─ Float Array 15x36

Christian


#4

Christian, glad to have you back. We hope you are doing better.

I took a quick peek and there is maybe good news and not-so-good news. The underlying HDF5 file appears to be around 200 MB in size (correct? otherwise there might be a lot of wasted space…). The two keywords that stand out and appear with some frequency are deflate (41,049) and TREE (41,619). Assuming your groups are not compressed, that suggests you are looking at least at 41,049 chunked, Gzip compressed datasets and 570 groups. (Since the dataset dimensions are so small 32x32, etc., that’s a bit of overkill, but we are where we are.) Apart from the dataset dimensions, do we know the size of those chunks? Maybe there is only one chunk per dataset? We can work without that information because we know that the size of a decompressed chunk must be divisible by the size of your floating-point numbers (4,8, ?), and we presumably know the endianness. To get at the actual data, we must:

  1. Traverse the B-trees corresponding to those chunked datasets (to understand where in the dataset a chunk belongs)
  2. Retrieve the compressed chunks located in the leaves of those trees
  3. Decompress them
  4. Recast them as floating-point arrays

There are a few things we don’t know, and that we may not be able to recover.

  1. We may not be able to establish with certainty which dataset is which, i.e., where in the corrupted (?) group structure it was located originally.
  2. If there are other datasets such as compact or contiguous datasets, we’ll have to hunt them down separately.

I made it sound easy, but this is mostly blood, sweat, and tears (aka, forensics) If the data is that important, you should either try it yourself or find someone who does it for you.

Best, G.


#5

Thank you! That does sound like a lot of work. Unfortunately the data is quite important. Since you mentioned finding someone to do it. Would there be a possibility to have someone from HDF Group do the forensics. I would guess since the expertise lies with you this would be much faster. If yes, what would be the hourly rate?

For now let me try an answer your questions and remarks.

  • Yes, the currupted HDF5 file has about 220 MB

  • I do not think that the groups are compressed. I only use compression on data set level. I use h5py to create the files.

    • To create the file:
      fd_hdf5 = h5py.File(file_path, 'w')
    • To append data to the file (where data is a 2d numpy float64 array with 32x32 (for evo groups) and 15x36 (for cts group)) :
      fd_hdf5.create_dataset(
      name=idx, data=data,
      shape=data.shape, dtype=data.dtype, maxshape=data.shape, chunks=True,
      compression="gzip", compression_opts=9
      )
    • Every data set also contains a float64 user attribute ts.
  • I think there should only be 3 groups. /thermal_evo_1, /thermal_evo_2, /thermal_cts_1 with a lot of data sets in each one. The structure is that way because every data set represents an image.

  • Do we know the chunk size… Well according to the h5py docs my setting performs auto-chunking. So I guess no. BUT I do have perfectly fine data files with the same structure but, obviously, different data. Might that help?

  • If it is not possible to establish which data set is which it does not matter if it is possible to regain the ts attribute of that data set.

  • How would I go about and traverse the B-trees? Is there an example or documentation? Or is the HDF source code open source? I guess looking at h5dump would make things a lot easier.

Thanks again,
Christian


#6

Yes, the link names for those can be seen on the root group’s local heap at (decimal) offsets 720, 736, 752.

No problem. If the data layout messages are intact, we can get the chunk size from there.

Good to know. What’s the datatype of these attributes? Is the dataspace scalar?

The file format specification can be found here. Yes, HDF5 is open source, always was and always will be, & the code can be found here. There are a few tools such as h5debug, h5check, and there is even a small tutorial on the file format.

Yes, and we’ll contact you.

Best, G.


#7

There is one attribute per data set named ts. The data type is a scalar float64.

Thank you for the pointers. So far I was able to iterate the B-Tree to where I can read the SymbolTableEntries of the 3 main groups and their names.

Thank you.


#8

Thank you for all your help and all of the very useful pointers. I managed to recover the data in the file. I did not get all the names of the datasets but with the mentioned attribute i was able to reconstruct the data.

Thank you and all the best,
Christian