Hi, I am currently doing a fuzzing project on HDF5 files for deep learning models and during the development of the fuzzer, as I was testing the fuzzer itself, I noticed that one of the generated files could not be opened.
I have tried using various tools to figure out which group/attribute/dataset caused this issue, but unfortunately, HDFView and the other h5tools as well as h5py could not show them.
For HDFView, I received this error in the command prompt while opening the file:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:///D:/HDFView/app/mods/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/HDFView/app/extra/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/HDFView/app/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.helpers.NOPLoggerFactory] Exception in thread "main" java.lang.InternalError: H5Gget_obj_info_full: retrieval of object info failed at jarhdf5@1.10.7/hdf.hdf5lib.H5.H5Gget_obj_info_full(Native Method) at jarhdf5@1.10.7/hdf.hdf5lib.H5.H5Gget_obj_info_full(H5.java:3605) at hdf.object.h5.H5File.depth_first(H5File.java:2433) at hdf.object.h5.H5File.depth_first(H5File.java:2495) at hdf.object.h5.H5File.loadIntoMemory(H5File.java:2374) at hdf.object.h5.H5File.open(H5File.java:2349) at hdf.object.h5.H5File.open(H5File.java:2220) at hdf.object.h5.H5File.open(H5File.java:1028) at hdf.view.TreeView.DefaultTreeView.initFile(DefaultTreeView.java:2463) at hdf.view.TreeView.DefaultTreeView.openFile(DefaultTreeView.java:2438) at hdf.view.HDFView.openLocalFile(HDFView.java:1820) at hdf.view.HDFView$26.widgetSelected(HDFView.java:983) at swt/org.eclipse.swt.widgets.TypedListener.handleEvent(TypedListener.java:252) at swt/org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:89) at swt/org.eclipse.swt.widgets.Display.sendEvent(Display.java:4213) at swt/org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1037) at swt/org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4030) at swt/org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3630) at hdf.view.HDFView.runMainWindow(HDFView.java:385) at hdf.view.HDFView$39.run(HDFView.java:2604) at swt/org.eclipse.swt.widgets.Synchronizer.syncExec(Synchronizer.java:236) at swt/org.eclipse.swt.widgets.Display.syncExec(Display.java:4735) at hdf.view.HDFView.main(HDFView.java:2595) Failed to launch JVM
In particular for h5tools, I tried using h5dump but received this error stack:
`#000: ..\src\H5L.c line 1333 in H5Lvisit_by_name(): link visitation failed
major: Links
minor: Iteration failed
#001: ..\src\H5Gint.c line 1177 in H5G_visit(): can't visit links
major: Symbol table
minor: Iteration failed
#002: ..\src\H5Gobj.c line 698 in H5G__obj_iterate(): can't iterate over symbol table
major: Symbol table
minor: Iteration failed
#003: ..\src\H5Gstab.c line 557 in H5G__stab_iterate(): iteration operator failed
major: Symbol table
minor: Can't move to next iterator location
#004: ..\src\H5B.c line 1211 in H5B_iterate(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed
#005: ..\src\H5B.c line 1167 in H5B__iterate_helper(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed
#006: ..\src\H5Gnode.c line 1015 in H5G__node_iterate(): iteration operator failed
major: Symbol table
minor: Can't move to next iterator location
#007: ..\src\H5Gobj.c line 698 in H5G__obj_iterate(): can't iterate over symbol table
major: Symbol table
minor: Iteration failed
#008: ..\src\H5Gstab.c line 557 in H5G__stab_iterate(): iteration operator failed
major: Symbol table
minor: Can't move to next iterator location
#009: ..\src\H5B.c line 1211 in H5B_iterate(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed
#010: ..\src\H5B.c line 1167 in H5B__iterate_helper(): B-tree iteration failed
major: B-Tree node
minor: Iteration failed
#011: ..\src\H5Gnode.c line 995 in H5G__node_iterate(): unable to get symbol table node name
major: Symbol table
minor: Can't get value
#012: ..\src\H5HL.c line 410 in H5HL_offset_into(): unable to offset into local heap data block
major: Heap
minor: Can't get value
h5dump error: internal error (file ..\tools\src\h5dump\h5dump.c:line 1579)
H5tools-DIAG: Error detected in HDF5:tools (1.10.6) thread 0:
#000: ..\tools\lib\h5tools_utils.c line 805 in init_objs(): finding shared objects failed
major: Failure in tools library
minor: error in function
#001: ..\tools\lib\h5trav.c line 1064 in h5trav_visit(): traverse failed
major: Failure in tools library
minor: error in function
#002: ..\tools\lib\h5trav.c line 292 in traverse(): H5Lvisit_by_name failed
major: Failure in tools library
minor: error in function`
I had also tried using h5copy to try and copy the parent group of the broken group/attribute/dataset but received an error as well:
Copying file <broken-model.h5> and object </model_weights> to file <broken-model-copy.h5> and object </model_weights_copy>
h5stat:
Filename: broken-model.h5 h5stat error: unable to traverse objects/links in file "broken-model.h5"
I have tried to reproduce this error through fuzzing as well, although I have been unsuccessful in creating this error.
Does anyone know or have any clue as to why the file could not be opened/broken?
Which version of HDF5 are you using and on which platform?
Iām a little fuzzy on what you are saying Is the HDF5 file created by the program that you are āfuzzingā? In other words, you are asking for trouble? How are you fuzzing the program? Are you overwriting (HDF5) library-internal data structures?
As for the versions:
h5py: 3.6.0
h5tools: 1.10.6
platform: Windows 10 Professional
Is the HDF5 file created by the program that you are āfuzzingā? How are you fuzzing the program? Are you overwriting (HDF5) library-internal data structures?
In terms of the file creation, the file itself was originally made as a deep learning model. Then, I created a fuzzer which would append random attributes, datasets, and groups (with random values inside them).
It should look something like this:
In other words, you are asking for trouble?
Yes, I am. This is the main goal of this project. To create random values and see if the file/program would behave differently or produce an error. On this note, I guess you could say that I did reach my goal of this fuzzing project, itās just that because the file is now inaccessible, I could not see why or how I reached my goal (i.e., find which values broke the file, etc.).
I am not sure what you mean by overwriting random locations in memory, but as far as I know, my fuzzer is only creating groups, datasets, and attributes within the file (using h5py create_group(), create_dataset(), and .attrs).
Are you setting the link and attribute name encoding to UTF-8? (I donāt remember if thatās the default in h5py.) I think the HDF5 library doesnāt check either way whether the byte string provided is a valid ASCII- or UTF-8-encoded string. With UTF-8 there are still plenty of byte sequences that donāt represent Unicode strings, but the odds are a little higher than with ASCII.
The original fuzzer itself is supposed to encode to ASCII, but when I supplied the characters into h5py, it got automatically encoded to UTF-8. With this, I would think that the characters supplied to the HDF5 file would be UTF-8 characters.
Yes, that is the kind of fuzzing I am talking about. Although, there is a small difference. I am not generating HDF5 files straight from scratch, but rather use valid HDF5 files that would be used for deep learning models and insert fuzzed inputs into these valid files.
At first, the main goal of my project was to see if Tensorflow/Keras would behave incorrectly due to these fuzzed values being inserted, but as you can see from the discussion, the HDF5 file itself instead became corrupt.
Thatās fine, but without explicitly specifying (H5Pset_char_encoding) that the supplied byte strings contain UTF-8 encoded material, the metadata in the file will be wrong and subsequent tools, based on the metadata in the file, will think they are dealing with ASCII rather than UTF-8 encoded link and attribute names (or values). Iām sure youāve accomplished your mission, but itād be even more convincing and helpful if you could show that you played by the rules. (In most languages or frameworks there is a concept of undefined behavior, and thatās a canonical source of mistakes, exploits, pranks, etc.)
On this note, does this mean that you believe that the char encoding could be one of the main reasons as to why the file got corrupted? If yes, do you think there could be other possible reasons as well?
I donāt know. Itās a possibility and itād be interesting to look at your generator. I think we also have to define what we mean by a ācorrupted HDF5 file.ā Is a presumed HDF5 file corrupt because a tool canāt read it? What if the tool is buggy?
A ācorrupted HDF5 fileā is a file whose binary layout does not conform to one of the versions of the HDF5 file format specification. Again, other definitions are possible, but more involved, and not necessarily equivalent. Chances are that the (de-)serializer portion of the HDF5 library is not bug free, and even if it were, there would be no guarantee that the HDF5 library would produce valid HDF5 files, even on average. The main reason is that the (de-)serializer is only a small fraction (< 10%) of the code, the rest implementing all kinds of use cases, and thatās where undefined behavior enters the picture.
Before the introduction of checksummed metadata (pre 2.0), the space for subtle forms of corruption was vast, but even with checksummed metadata thereās still plenty of opportunity. Itād be interesting to look into different categories of corruption for user- and file-level data and metadata. This would be a great topic for one or more profound MS & Ph.D. theses. (Talk to me, if youād be interested!)
A ācorrupted HDF5 fileā is a file whose binary layout does not conform to one of the versions of the HDF5 file format specificationā¦
I see. This is precisely what I am currently trying to figure out. I will keep your words in mind as I investigate further.
This would be a great topic for one or more profound MS & Ph.D. theses. (Talk to me, if youād be interested!)
Indeed, it is! Although, I have to admit that I am still rather new to this, so I will have to pass up on the offer for now. When I do have enough experience and knowledge to conduct a more structured project, I shall contact you then.
I forgot to mention that thereās of course the possibility that the specification itself is inconsistent or ambiguous. Iāve tried a few times to find support for a more formal approach (formal specification + SerDe code auto-generation), but funding agencies werenāt interested. I havenāt given up and I like what you are doing.
Unfortunate⦠I suggest you to implement it in C or C++ on a POSIX OS (as opposed to python, java, ā¦), use RAII; and direct calls to HDF5 CLIB ā donāt forget to follow the CAPI contracts, and you may have to implement your own h5dump to verify your results.
Best is to pick up this thread here once your project is public.
A tool for visualizing the binary structure of (corrupted) HDF5 files would be nice. There are plenty of ideas on how to do this, including a customized version of this one. G.
Iāve tried a few times to find support for a more formal approach (formal specification + SerDe code auto-generation), but funding agencies werenāt interested. I havenāt given up and I like what you are doing.
That does seem very interesting, indeed. Please do let me know about the progress, I am quite curious about this project as well!
A tool for visualizing the binary structure of (corrupted) HDF5 files would be nice. There are plenty of ideas on how to do this, including a customized version of this one.
That seems great! I will try out the tool and see if that could help, thanks!
Unfortunate⦠I suggest you to implement it in C or C++ on a POSIX OS (as opposed to python, java, ā¦), use RAII; and direct calls to HDF5 CLIB ā donāt forget to follow the CAPI contracts, and you may have to implement your own h5dump to verify your results.