I Could use some help parsing HDF5 files, in particular w/ Data Object Headers

I suspect the deflate filter saves the raw deflate stream, not GZip format (which is deflate stream + gzip header/trailer), but check the HDF5 source to be sure.

Not familiar with .NET myself, but System.IO.Compression.GZipStream sounds like a class that deals with GZip format (including header/trailer). Perhaps System.IO.Compression.DeflateStream is the class to use?

Though I now found this SO answer suggesting DeflateStream is not in fact compatible with ZLIB’s deflate algorithm: https://stackoverflow.com/a/70658/252857 So it may be that you are out of luck :frowning:

yeah, i did try both classes before posting, same result. guess i need to write or port my own ;). this project keeps on giving… :crazy_face:

Tried a different library, same/similar error. I think I’m missing something else and this isn’t actually raw/valid GZIP data I’m looking at. Will investigate more when I have time…

Maybe try printing the first few bytes of the chunk data in HDF5’s own H5Zdeflate.c and compare with what you have, to make sure you’re reading from the right spot?

Elvis

BTW I was reading the spec a little, and I’m curious: When you traverse the v1 b-tree to find a chunk, what is the comparison used between chunks? Because to me I think that’s one of the things that look a little under-specced. The spec mentions that chunks are order in the tree by their index into the dataset they belong to (that’s obvious), but it doesn’t mention if the comparison used first compares by the index in the slowest changing dimension, then the index in the next to slowest dimension and so on, or if it’s the other way around. Did you discover this yourself, or was it obvious to you after reading the spec?

In short, what definition of “less than” is used for chunks in the b-tree?

The place in the spec I’m talking about is https://support.hdfgroup.org/HDF5/doc/H5.format.html#V1Btrees , in the description of the Key field, and also in the text right below the field description table.

Elvis

And also, when v1 btrees are used for keeping group children, I guess the children are ordered lexicographically in the tree? (but I can’t find this specified explicitly either).

Elvis

Hi Elvis!

16.05.2018 9:06, Elvis Stansvik пишет:

And also, when v1 btrees are used for keeping group children, I guess
the children are ordered lexicographically in the tree? (but I can’t
find this specified explicitly either).

I think it’s guided by “link creation order”:
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_link_creation_order.htm
(and see also
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_attr_creation_order.htm
)

Best wishes,
Andrey Paramonov

I think documentation on the old H5Giterate and newer H5Literate discuss this issue somewhat. But, only insofar as the API itself is specified, not the actual internal storage implementation used by the library. And, I think that is best you can rely upon as the lib internals are not part of the API specification.

Thanks, but that seems to just control whether a separate index is also created for tracking the creation order. I’m pretty sure the primary index is ordered lexicographically by link name? (to allow fast lookup of paths).

I just wondered whether this was specified anywhere in the spec. Even if it’s kind of obvious in the case of group links, I think the spec is incomplete without specifying by which criterion the indices are ordered.

I was more interested in the ordering criterion used for the chunk index, since I think that is less obvious. I can’t find that in the spec either.

Best regards,
Elvis

Thanks, but I was talking from the standpoint of a third party implementor of the spec, not as user of the HDF5 library. To be able to correctly implement a reader, one must know by which criterion the indices are ordered, and it seems this info is left out of the spec?

Oh, really? I hadn’t understood that was your aim. So, you mean to achieve a bytes-on-disk arrangement that matches what HDF5 lib expects without using HDF5 lib implementation? I guess I should have read the whole thread before commenting :wink:

the data starts in my file at 0xD18, and looks like this:

789CECBDF777E2D89680AB1FDE5A6FBDB9333775
DF4E151D48069C730E6424A104220703CE95ABABA
AF39D3B71CDCCFFDC6F6F09BBB08D6DB08123C4
FEBACB0113E4A38FED7DD2D6EFBF13DD85EB04D
6074B102DE8C861929BB026DDD198C426D8D22B8
...

it’s the beginning of. block of data that looks “random” (preceded by lots of zeros and more structured “HDF5-looking” bytes, so I’m tempted to believe it’s the correct address — but maybe the raw chunk doesn’t contain the GZIP data right away, but has another preamble or header? I’m not sure what a GZIP compressed stream of data should look like, but it fails right on the first byte(s) it reads (with the bit-more-precise error “Message: Bad state (invalid stored block lengths)”, when using the open source gzip implementation).

Well, a hypothetical standpoint for my own part :slight_smile: (so not a direct aim at this point). Though I have had thoughts about writing a simple reader for subset of the format. Marc, the original poster, is the one actually working on an implementation. I’m just an interested bystander.

What I was thinking was reading, not writing, HDF5 files (and is what Marc is doing).

in my case, the opposite — I don’t care about writing files, just reading them.

If I read RFC 1950 correctly, a ZLIB stream in “deflate” format (which I think is what H5Zdeflate.c filter uses, by using the compress2 function, someone correct me if I’m wrong) should start with the nibble 1000b = 0x8, while your data starts with 0111b = 0x7.

Ok. then what could I be missing, given this is the address that my DataLayout message points to? Am I getting the wrong address, or is it the right address, but the data is in a different format?

my DataLayoutObjectHeaderMessage is at 0x0BC0 and points to a tree at 0x0CA0

<DataLayoutObjectHeaderMessage @BC0: 2 Dimension(s), [], Size 1>
<Tree @CA0, 0 children, siblings (-, -)>

54524545 //sig
01000100 // flags & co
FFFFFFFF //sibling
FFFFFFF.F// sibling
F85B0000 // size of chunk
00000000 // filtermask (ie none are skipped
00000000 // dim1 (64 bits)
00000000 // dim1 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim extra (64 bits)
00000000 // dim extra (64 bits)
180D0000 // address => 0d18

looks correct to me.

my messages are

<DataSpaceObjectHeaderMessage @B58: <2 Dimension(s), [720x720]>>
<DataTypeObjectHeaderMessage @B70: <FixedPoint Size 1 0/8>>
<DataStorageFillValueObjectHeaderMessage @B88: >
<DataStorageFilterPipelineObjectHeaderMessage @B98: Deflate>
<DataLayoutObjectHeaderMessage @BC0: 2 Dimension(s), [], Size 1>
<ObjectHeaderContinuationObjectHeaderMessage @BE0, >
<NilObjectHeaderMessage @BF0>

so the only relevant data transformation should be the Deflate filter, right?

—marc

Looks correct to me, so don’t know really :frowning:

1 Like

:frowning: @koziol, any ideas?

Bumped up as new thread Stuck trying figure out how to read GZIP data from HDF5