Broken HDF5 file cannot be opened


#1

Hi, I am currently doing a fuzzing project on HDF5 files for deep learning models and during the development of the fuzzer, as I was testing the fuzzer itself, I noticed that one of the generated files could not be opened.

I have tried using various tools to figure out which group/attribute/dataset caused this issue, but unfortunately, HDFView and the other h5tools as well as h5py could not show them.

For HDFView, I received this error in the command prompt while opening the file:

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:///D:/HDFView/app/mods/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/HDFView/app/extra/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/HDFView/app/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.helpers.NOPLoggerFactory] Exception in thread "main" java.lang.InternalError: H5Gget_obj_info_full: retrieval of object info failed at jarhdf5@1.10.7/hdf.hdf5lib.H5.H5Gget_obj_info_full(Native Method) at jarhdf5@1.10.7/hdf.hdf5lib.H5.H5Gget_obj_info_full(H5.java:3605) at hdf.object.h5.H5File.depth_first(H5File.java:2433) at hdf.object.h5.H5File.depth_first(H5File.java:2495) at hdf.object.h5.H5File.loadIntoMemory(H5File.java:2374) at hdf.object.h5.H5File.open(H5File.java:2349) at hdf.object.h5.H5File.open(H5File.java:2220) at hdf.object.h5.H5File.open(H5File.java:1028) at hdf.view.TreeView.DefaultTreeView.initFile(DefaultTreeView.java:2463) at hdf.view.TreeView.DefaultTreeView.openFile(DefaultTreeView.java:2438) at hdf.view.HDFView.openLocalFile(HDFView.java:1820) at hdf.view.HDFView$26.widgetSelected(HDFView.java:983) at swt/org.eclipse.swt.widgets.TypedListener.handleEvent(TypedListener.java:252) at swt/org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:89) at swt/org.eclipse.swt.widgets.Display.sendEvent(Display.java:4213) at swt/org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1037) at swt/org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4030) at swt/org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3630) at hdf.view.HDFView.runMainWindow(HDFView.java:385) at hdf.view.HDFView$39.run(HDFView.java:2604) at swt/org.eclipse.swt.widgets.Synchronizer.syncExec(Synchronizer.java:236) at swt/org.eclipse.swt.widgets.Display.syncExec(Display.java:4735) at hdf.view.HDFView.main(HDFView.java:2595) Failed to launch JVM

In particular for h5tools, I tried using h5dump but received this error stack:

    `#000: ..\src\H5L.c line 1333 in H5Lvisit_by_name(): link visitation failed
    major: Links
    minor: Iteration failed
  #001: ..\src\H5Gint.c line 1177 in H5G_visit(): can't visit links
    major: Symbol table
    minor: Iteration failed
  #002: ..\src\H5Gobj.c line 698 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: ..\src\H5Gstab.c line 557 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: ..\src\H5B.c line 1211 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: ..\src\H5B.c line 1167 in H5B__iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: ..\src\H5Gnode.c line 1015 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #007: ..\src\H5Gobj.c line 698 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #008: ..\src\H5Gstab.c line 557 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #009: ..\src\H5B.c line 1211 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #010: ..\src\H5B.c line 1167 in H5B__iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #011: ..\src\H5Gnode.c line 995 in H5G__node_iterate(): unable to get symbol table node name
    major: Symbol table
    minor: Can't get value
  #012: ..\src\H5HL.c line 410 in H5HL_offset_into(): unable to offset into local heap data block
    major: Heap
    minor: Can't get value
h5dump error: internal error (file ..\tools\src\h5dump\h5dump.c:line 1579)
H5tools-DIAG: Error detected in HDF5:tools (1.10.6) thread 0:
  #000: ..\tools\lib\h5tools_utils.c line 805 in init_objs(): finding shared objects failed
    major: Failure in tools library
    minor: error in function
  #001: ..\tools\lib\h5trav.c line 1064 in h5trav_visit(): traverse failed
    major: Failure in tools library
    minor: error in function
  #002: ..\tools\lib\h5trav.c line 292 in traverse(): H5Lvisit_by_name failed
    major: Failure in tools library
    minor: error in function`

I had also tried using h5copy to try and copy the parent group of the broken group/attribute/dataset but received an error as well:

Copying file <broken-model.h5> and object </model_weights> to file <broken-model-copy.h5> and object </model_weights_copy>

h5stat:

Filename: broken-model.h5 h5stat error: unable to traverse objects/links in file "broken-model.h5"

I have tried to reproduce this error through fuzzing as well, although I have been unsuccessful in creating this error.

Does anyone know or have any clue as to why the file could not be opened/broken?

Thanks!


#2

Which version of HDF5 are you using and on which platform?

I’m a little fuzzy on what you are saying :wink: Is the HDF5 file created by the program that you are “fuzzing”? In other words, you are asking for trouble? How are you fuzzing the program? Are you overwriting (HDF5) library-internal data structures?

G.


#3

As for the versions:
h5py: 3.6.0
h5tools: 1.10.6
platform: Windows 10 Professional

Is the HDF5 file created by the program that you are “fuzzing”? How are you fuzzing the program? Are you overwriting (HDF5) library-internal data structures?

In terms of the file creation, the file itself was originally made as a deep learning model. Then, I created a fuzzer which would append random attributes, datasets, and groups (with random values inside them).

It should look something like this:
image

In other words, you are asking for trouble?

Yes, I am. This is the main goal of this project. To create random values and see if the file/program would behave differently or produce an error. On this note, I guess you could say that I did reach my goal of this fuzzing project, it’s just that because the file is now inaccessible, I could not see why or how I reached my goal (i.e., find which values broke the file, etc.).

Thanks!
Wellson


#4

Out of curiosity, are you just fuzzing attribute and link names, and attribute/dataset values, or are you overwriting random locations in memory?

G.


#5

I am not sure what you mean by overwriting random locations in memory, but as far as I know, my fuzzer is only creating groups, datasets, and attributes within the file (using h5py create_group(), create_dataset(), and .attrs).

Wellson


#6

Are you setting the link and attribute name encoding to UTF-8? (I don’t remember if that’s the default in h5py.) I think the HDF5 library doesn’t check either way whether the byte string provided is a valid ASCII- or UTF-8-encoded string. With UTF-8 there are still plenty of byte sequences that don’t represent Unicode strings, but the odds are a little higher than with ASCII.

G.


#7

The original fuzzer itself is supposed to encode to ASCII, but when I supplied the characters into h5py, it got automatically encoded to UTF-8. With this, I would think that the characters supplied to the HDF5 file would be UTF-8 characters.

Wellson


#8

Interesting topic: I had to look up what fuzzing meant; are you using it in the same sense or is your project more like generating random HDF5 file?


#9

Hi Steven,

Yes, that is the kind of fuzzing I am talking about. Although, there is a small difference. I am not generating HDF5 files straight from scratch, but rather use valid HDF5 files that would be used for deep learning models and insert fuzzed inputs into these valid files.

At first, the main goal of my project was to see if Tensorflow/Keras would behave incorrectly due to these fuzzed values being inserted, but as you can see from the discussion, the HDF5 file itself instead became corrupt.

Wellson


#10

That’s fine, but without explicitly specifying (H5Pset_char_encoding) that the supplied byte strings contain UTF-8 encoded material, the metadata in the file will be wrong and subsequent tools, based on the metadata in the file, will think they are dealing with ASCII rather than UTF-8 encoded link and attribute names (or values). I’m sure you’ve accomplished your mission, but it’d be even more convincing and helpful if you could show that you played by the rules. (In most languages or frameworks there is a concept of undefined behavior, and that’s a canonical source of mistakes, exploits, pranks, etc.)

Best, G.


#11

I see, that makes sense.

On this note, does this mean that you believe that the char encoding could be one of the main reasons as to why the file got corrupted? If yes, do you think there could be other possible reasons as well?

Wellson


#12

I don’t know. It’s a possibility and it’d be interesting to look at your generator. I think we also have to define what we mean by a ‘corrupted HDF5 file.’ Is a presumed HDF5 file corrupt because a tool can’t read it? What if the tool is buggy?

A ‘corrupted HDF5 file’ is a file whose binary layout does not conform to one of the versions of the HDF5 file format specification. Again, other definitions are possible, but more involved, and not necessarily equivalent. Chances are that the (de-)serializer portion of the HDF5 library is not bug free, and even if it were, there would be no guarantee that the HDF5 library would produce valid HDF5 files, even on average. The main reason is that the (de-)serializer is only a small fraction (< 10%) of the code, the rest implementing all kinds of use cases, and that’s where undefined behavior enters the picture.

Before the introduction of checksummed metadata (pre 2.0), the space for subtle forms of corruption was vast, but even with checksummed metadata there’s still plenty of opportunity. It’d be interesting to look into different categories of corruption for user- and file-level data and metadata. This would be a great topic for one or more profound MS & Ph.D. theses. (Talk to me, if you’d be interested!)

Fascinating stuff! G.


#13

Thanks! I may have missed, is this project publicly accessible? Also can you attach the file?
steve


#14

A ‘corrupted HDF5 file’ is a file whose binary layout does not conform to one of the versions of the HDF5 file format specification…

I see. This is precisely what I am currently trying to figure out. I will keep your words in mind as I investigate further.

This would be a great topic for one or more profound MS & Ph.D. theses. (Talk to me, if you’d be interested!)

Indeed, it is! Although, I have to admit that I am still rather new to this, so I will have to pass up on the offer for now. When I do have enough experience and knowledge to conduct a more structured project, I shall contact you then.

Thank you very much.

Wellson


#15

Currently, not yet. The project is still private for now as it is still mostly in development phase, but I will let you know once it becomes public.

Thank you for your interest!

Wellson


#16

I forgot to mention that there’s of course the possibility that the specification itself is inconsistent or ambiguous. I’ve tried a few times to find support for a more formal approach (formal specification + SerDe code auto-generation), but funding agencies weren’t interested. I haven’t given up and I like what you are doing.

Best, G.


#17

Unfortunate… I suggest you to implement it in C or C++ on a POSIX OS (as opposed to python, java, …), use RAII; and direct calls to HDF5 CLIB – don’t forget to follow the CAPI contracts, and you may have to implement your own h5dump to verify your results.

Best is to pick up this thread here once your project is public.


#18

A tool for visualizing the binary structure of (corrupted) HDF5 files would be nice. There are plenty of ideas on how to do this, including a customized version of this one. G.


#19

I’ve tried a few times to find support for a more formal approach (formal specification + SerDe code auto-generation), but funding agencies weren’t interested. I haven’t given up and I like what you are doing.

That does seem very interesting, indeed. Please do let me know about the progress, I am quite curious about this project as well!

A tool for visualizing the binary structure of (corrupted) HDF5 files would be nice. There are plenty of ideas on how to do this, including a customized version of this one.

That seems great! I will try out the tool and see if that could help, thanks!

Wellson


#20

Unfortunate… I suggest you to implement it in C or C++ on a POSIX OS (as opposed to python, java, …), use RAII; and direct calls to HDF5 CLIB – don’t forget to follow the CAPI contracts, and you may have to implement your own h5dump to verify your results.

Noted! Thanks for your suggestion!

Wellson