I Could use some help parsing HDF5 files, in particular w/ Data Object Headers

elvis.stansvik · May 14, 2018, 10:40am

Andrey_Paramonov https://forum.hdfgroup.org/u/andrey_paramonov
May 14

Hello Elvis!

13.05.2018 22:18, Elvis Stansvik пишет:

BTW when you’re done with your library it would be great if you posted a
link.

Side note: From time to time I’ve thought about writing a minimal C library
to read a subset of HDF5 myself. The reason being that we are doing
multi-threaded reading of multiple HDF5 files in a GUI app (to avoid
blocking the UI), but the fact that the HDF5 library handles thread safety
by simply taking a global lock (so is not thread efficient) is becoming an
annoyance, since large reads can block small ones for quite some time. Our
requirements are not big, we simply need to be able to read chunked
compressed datasets, and we’re only using the “old” pre-1.10 format. So I
don’t think a minimal thread safe/thread efficient C library to do just
that would be that big an effort. We have no need for writing files, and we
have no need for HPC features like MPI et.c.

If anyone else reading this has already written such a library (I’m
thinking maybe for embedded applications?), please shout out!

Indeed, there are several alternative implementations of HDF5 data
interface:

Forum thread by Markus Krug:
http://hdf-forum.184993.n3.nabble.com/HDF-lib-
incompatible-with-HDF-file-spec-td4029881.html
http://hdf-forum.184993.n3.nabble.com/HDF-lib-incompatible-with-HDF-file-spec-td4029881.html

libmysofa (for embedded devices):
GitHub - hoene/libmysofa: Reader for AES SOFA files to get better HRTFs

pyfive (pure Python HDF5 reader):
GitHub - jjhelmus/pyfive: A pure Python HDF5 file reader

Thanks Andrey, I did not know about libsofa. Will have a look.

Elvis

I collect these links because I believe alternative implementations of a

dwarfland · May 14, 2018, 4:35pm

If I got the binary data from the raw data chunk, and the only filter present is a “deflate”, should I be able to pass tis data straight into a GZIP decompress API to get the final results? Because I’m trying to do that using .NET’s System.IO.Compression.GZipStream, and it says the data is corrupted…

Am I missing another step?

elvis.stansvik · May 14, 2018, 5:56pm

I suspect the deflate filter saves the raw deflate stream, not GZip format (which is deflate stream + gzip header/trailer), but check the HDF5 source to be sure.

Not familiar with .NET myself, but System.IO.Compression.GZipStream sounds like a class that deals with GZip format (including header/trailer). Perhaps System.IO.Compression.DeflateStream is the class to use?

elvis.stansvik · May 14, 2018, 6:08pm

Though I now found this SO answer suggesting DeflateStream is not in fact compatible with ZLIB’s deflate algorithm: https://stackoverflow.com/a/70658/252857 So it may be that you are out of luck

dwarfland · May 14, 2018, 9:21pm

yeah, i did try both classes before posting, same result. guess i need to write or port my own ;). this project keeps on giving…

dwarfland · May 15, 2018, 1:21pm

Tried a different library, same/similar error. I think I’m missing something else and this isn’t actually raw/valid GZIP data I’m looking at. Will investigate more when I have time…

elvis.stansvik · May 16, 2018, 5:50am

Maybe try printing the first few bytes of the chunk data in HDF5’s own H5Zdeflate.c and compare with what you have, to make sure you’re reading from the right spot?

Elvis

elvis.stansvik · May 16, 2018, 5:58am

BTW I was reading the spec a little, and I’m curious: When you traverse the v1 b-tree to find a chunk, what is the comparison used between chunks? Because to me I think that’s one of the things that look a little under-specced. The spec mentions that chunks are order in the tree by their index into the dataset they belong to (that’s obvious), but it doesn’t mention if the comparison used first compares by the index in the slowest changing dimension, then the index in the next to slowest dimension and so on, or if it’s the other way around. Did you discover this yourself, or was it obvious to you after reading the spec?

In short, what definition of “less than” is used for chunks in the b-tree?

The place in the spec I’m talking about is https://support.hdfgroup.org/HDF5/doc/H5.format.html#V1Btrees , in the description of the Key field, and also in the text right below the field description table.

Elvis

elvis.stansvik · May 16, 2018, 6:01am

And also, when v1 btrees are used for keeping group children, I guess the children are ordered lexicographically in the tree? (but I can’t find this specified explicitly either).

Elvis

paramon · May 16, 2018, 7:01am

Hi Elvis!

16.05.2018 9:06, Elvis Stansvik пишет:

And also, when v1 btrees are used for keeping group children, I guess
the children are ordered lexicographically in the tree? (but I can’t
find this specified explicitly either).

I think it’s guided by “link creation order”:
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_link_creation_order.htm
(and see also
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_attr_creation_order.htm
)

Best wishes,
Andrey Paramonov

miller86 · May 16, 2018, 3:24pm

I think documentation on the old H5Giterate and newer H5Literate discuss this issue somewhat. But, only insofar as the API itself is specified, not the actual internal storage implementation used by the library. And, I think that is best you can rely upon as the lib internals are not part of the API specification.

elvis.stansvik · May 16, 2018, 5:19pm

Thanks, but that seems to just control whether a separate index is also created for tracking the creation order. I’m pretty sure the primary index is ordered lexicographically by link name? (to allow fast lookup of paths).

I just wondered whether this was specified anywhere in the spec. Even if it’s kind of obvious in the case of group links, I think the spec is incomplete without specifying by which criterion the indices are ordered.

I was more interested in the ordering criterion used for the chunk index, since I think that is less obvious. I can’t find that in the spec either.

Best regards,
Elvis

elvis.stansvik · May 16, 2018, 5:23pm

Thanks, but I was talking from the standpoint of a third party implementor of the spec, not as user of the HDF5 library. To be able to correctly implement a reader, one must know by which criterion the indices are ordered, and it seems this info is left out of the spec?

miller86 · May 16, 2018, 5:52pm

Oh, really? I hadn’t understood that was your aim. So, you mean to achieve a bytes-on-disk arrangement that matches what HDF5 lib expects without using HDF5 lib implementation? I guess I should have read the whole thread before commenting

dwarfland · May 16, 2018, 6:27pm

the data starts in my file at 0xD18, and looks like this:

789CECBDF777E2D89680AB1FDE5A6FBDB9333775
DF4E151D48069C730E6424A104220703CE95ABABA
AF39D3B71CDCCFFDC6F6F09BBB08D6DB08123C4
FEBACB0113E4A38FED7DD2D6EFBF13DD85EB04D
6074B102DE8C861929BB026DDD198C426D8D22B8
...

it’s the beginning of. block of data that looks “random” (preceded by lots of zeros and more structured “HDF5-looking” bytes, so I’m tempted to believe it’s the correct address — but maybe the raw chunk doesn’t contain the GZIP data right away, but has another preamble or header? I’m not sure what a GZIP compressed stream of data should look like, but it fails right on the first byte(s) it reads (with the bit-more-precise error “Message: Bad state (invalid stored block lengths)”, when using the open source gzip implementation).

elvis.stansvik · May 16, 2018, 6:28pm

Well, a hypothetical standpoint for my own part (so not a direct aim at this point). Though I have had thoughts about writing a simple reader for subset of the format. Marc, the original poster, is the one actually working on an implementation. I’m just an interested bystander.

What I was thinking was reading, not writing, HDF5 files (and is what Marc is doing).

dwarfland · May 16, 2018, 6:28pm

in my case, the opposite — I don’t care about writing files, just reading them.

elvis.stansvik · May 16, 2018, 6:45pm

If I read RFC 1950 correctly, a ZLIB stream in “deflate” format (which I think is what H5Zdeflate.c filter uses, by using the compress2 function, someone correct me if I’m wrong) should start with the nibble 1000b = 0x8, while your data starts with 0111b = 0x7.

dwarfland · May 16, 2018, 7:07pm

Ok. then what could I be missing, given this is the address that my DataLayout message points to? Am I getting the wrong address, or is it the right address, but the data is in a different format?

my DataLayoutObjectHeaderMessage is at 0x0BC0 and points to a tree at 0x0CA0

54524545 //sig
01000100 // flags & co
FFFFFFFF //sibling
FFFFFFF.F// sibling
F85B0000 // size of chunk
00000000 // filtermask (ie none are skipped
00000000 // dim1 (64 bits)
00000000 // dim1 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim extra (64 bits)
00000000 // dim extra (64 bits)
180D0000 // address => 0d18

looks correct to me.

my messages are

<DataSpaceObjectHeaderMessage @B58: <2 Dimension(s), [720x720]>>
<DataTypeObjectHeaderMessage @B70: <FixedPoint Size 1 0/8>>
<DataStorageFillValueObjectHeaderMessage @B88: >
<DataStorageFilterPipelineObjectHeaderMessage @B98: Deflate>
<DataLayoutObjectHeaderMessage @BC0: 2 Dimension(s), [], Size 1>
<ObjectHeaderContinuationObjectHeaderMessage @BE0, >
<NilObjectHeaderMessage @BF0>

so the only relevant data transformation should be the Deflate filter, right?

—marc

elvis.stansvik · May 16, 2018, 7:32pm

Looks correct to me, so don’t know really

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

I Could use some help parsing HDF5 files, in particular w/ Data Object Headers