Stuck trying figure out how to read GZIP data from HDF5

dwarfland · May 28, 2018, 12:52pm

I’m pulling this out to a new thread from here…

I’m still stuck a bit with my HDF5 reader implementation. I’m trying to read the actual data block in this file (dropbox link), which according to the messages is (supposedly) GZIP-deflated.

My DataLayoutObjectHeaderMessage is at 0x0BC0 and points to a tree at 0x0CA0

<DataLayoutObjectHeaderMessage @BC0: 2 Dimension(s), [], Size 1>
<Tree @CA0, 0 children, siblings (-, -)>

54524545 //sig
01000100 // flags & co
FFFFFFFF //sibling
FFFFFFF.F// sibling
F85B0000 // size of chunk
00000000 // filtermask (ie none are skipped
00000000 // dim1 (64 bits)
00000000 // dim1 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim2 (64 bits)
00000000 // dim extra (64 bits)
00000000 // dim extra (64 bits)
180D0000 // address => 0d18

which looks correct to me. the full list of messages on the object is:

<DataSpaceObjectHeaderMessage @B58: <2 Dimension(s), [720x720]>>
<DataTypeObjectHeaderMessage @B70: <FixedPoint Size 1 0/8>>
<DataStorageFillValueObjectHeaderMessage @B88: >
<DataStorageFilterPipelineObjectHeaderMessage @B98: Deflate> // GZIP!
<DataLayoutObjectHeaderMessage @BC0: 2 Dimension(s), [], Size 1>
<ObjectHeaderContinuationObjectHeaderMessage @BE0, >
<NilObjectHeaderMessage @BF0>

so the only relevant data transformation seems to be the GZIP Deflate filter. Yet the data does not seem to be a valid GZIP stream; I tried several different implementations and they all refuse the data The data starts in my file at 0xD18, and looks like this:

789CECBDF777E2D89680AB1FDE5A6FBDB9333775
DF4E151D48069C730E6424A104220703CE95ABABA
AF39D3B71CDCCFFDC6F6F09BBB08D6DB08123C4
FEBACB0113E4A38FED7DD2D6EFBF13DD85EB04D
6074B102DE8C861929BB026DDD198C426D8D22B8
...

it’s the beginning of a block of data that looks “random” (preceded by lots of zeros and more structured “HDF5-looking” bytes, so I’m tempted to believe it’s the correct address — but maybe the raw chunk doesn’t contain the GZIP data right away, but has another preamble or header? I’m not sure what a GZIP compressed stream of data should look like, but it fails right on the first byte(s) it reads (with the bit-more-precise error “Message: Bad state (invalid stored block lengths)”, when using the open source gzip implementation).

Unfortunately, the spec just says “Filters supported by The HDF Group are documented immediately below.” but the does not really define how the filters work in more detail, aside from stating that id “01” means “GZIP deflate compression”…

I’m sur i’m missing something, does the filter add some sort of preamble or other processing acound the actual GZIP data?

Any help would be greatly appreciated, as I need to move forward with this project… (I’ll open source the reader implementation, when done

thanx,
marc

epourmal · May 28, 2018, 3:48pm

Hi Marc,

Data address is correct. You can use h5debug tool to confirm it:

% h5debug Curacao180420162506.PPI8166.h5 2888

Reading signature at address 2888 (ref)

…

Message 4…

Message ID (sequence number): 0x0008 `layout’ (0)

Dirty: FALSE

Message flags:

Chunk number: 0

Raw message data (offset, size) in chunk: (128, 24) bytes

Message Information:

  Version:                                     3

  Type:                                        Chunked

  Number of dimensions:                        3

  Size:                                        {720, 720, 1}

  Index Type:                                  v1 B-tree

  Index address:                               3232

…

% h5debug Curacao180420162506.PPI8166.h5 3232 3 720 720 1

Reading signature at address 3232 (rel)

Tree type ID: H5B_CHUNK_ID

Size of node: 120

Size of raw (disk) key: 32

Dirty flag: False

Level: 0

Address of left sibling: UNDEF

Address of right sibling: UNDEF

Number of children (max): 1 (2)

Child 0…

Address: 3352

Left Key:

  Chunk size:                                  23544 bytes

  Filter mask:                                 0x00000000

  Logical offset:                              {0, 0, 0}

Right Key:

  Chunk size:                                  0 bytes

  Filter mask:                                 0x00000000

  Logical offset:                              {720, 720, 1}

Address of the chunk is 3352 or 0xD18

Filter mask is 0 meaning the filter was applied, and of course it was since the size of compressed data is 23544.

Data in the chunk shouldn’t have any other headers, etc., i.e., GZIP should work.

Have you tried (just for sanity checking), to read compressed chunk with H5DOread_chunk function to get compressed data (you will need to provide buffer size; see the example on the page) and compare it what you are seeing in the file?

Hopefully someone else will chime in and will point to something I am missing…

Thank you!

Elena

dwarfland · May 28, 2018, 4:12pm

Elena,

thanx a lot for your reply!

in the other thread, @Elvis_Stansvik pointed out that “If I read RFC 1950 correctly, a ZLIB stream in “deflate” format (which I think is what H5Zdeflate.c filter uses, by using the compress2 function, someone correct me if I’m wrong) should start with the nibble 1000b = 0x8, while your data starts with 0111b = 0x7”

So it seems the data is simply not just a plain ZLIB stream. While I can’t say i fully comprehend the (third party/open source) lib implementation I tried, with the little debugging I did thru it, I can confirm it does fail right on the very first byte it reads, with “invalid state”.

I’ll see if I can get that into a shape I can use (I’m on Mac, and not really that deeply familiar with Windows C++ build chains to have attempted to build that code on my own, yet ;). I’ll give that a try.

dwarfland · May 28, 2018, 4:18pm

In fact, the different library I tried right now fails very cleanly on:

if (header[0] != 0x1F || header[1] != 0x8B || header[2] != 8)
    throw new ZlibException("Bad GZIP header.");

where the bytes it’s checking are of course “78 9C EC” which clearly does not match what this expects. In fact I cannot find 1F8B anywhere in the file…

elvis.stansvik · May 28, 2018, 6:48pm

Though, that seems to be from a library that reads GZip format (including
the GZip header). Like I mentioned in the other thread, the chunks in HDF5
are compresses just using raw ZLIB deflate (no GZip header). So you’ll have
to find a library that decompresses just plain ZLIB deflate, not GZip
(perhaps the ones you’ve tried do support that, just through different
functions than the ones you’ve used?).

But in any case, if you look at RFC 1950 [1], section 2.2 Data format, it
does say that the first nibble should be 8. So either

I’m misunderstanding what algorithm the compress2 from zlib uses (which
is what H5Zdeflate.c uses),
The data is preceeded by something (undocumented?) like you mentioned, or
You don’t have the right data after all

Elvis

[1] https://tools.ietf.org/html/rfc1950

elvis.stansvik · May 28, 2018, 6:58pm

dwarfland https://forum.hdfgroup.org/u/dwarfland
May 28

In fact, the different library I tried right now fails very cleanly on:

if (header[0] != 0x1F || header[1] != 0x8B || header[2] != 8)
throw new ZlibException(“Bad GZIP header.”);

where the bytes it’s checking are of course “78 9C EC” which clearly
does not match what this expects. In fact I cannot find 1F8B anywhere in
the file…be

Though, that seems to be from a library that reads GZip format (including
the GZip header). Like I mentioned in the other thread, the chunks in HDF5
are compresses just using raw ZLIB deflate (no GZip header). So you’ll have
to find a library that decompresses just plain ZLIB deflate, not GZip
(perhaps the ones you’ve tried do support that, just through different
functions than the ones you’ve used?).

But in any case, if you look at RFC 1950 [1], section 2.2 Data format, it
does say that the first nibble should be 8. So either

In fact, I’ve misunderstood how they order the bits in that RFC.

The beginning of your data look good. The 7 nibble is the CINFO field and
indicates 32K window size, the 8 nibble that follow is the compression
method (deflate).

So I think it’s simply a matter of finding a decompression library that
supports plain ZLIB deflate streams (not the original old deflate
algorithm, which I think is slightly different).

Elvis

dwarfland · May 28, 2018, 7:58pm

ah, yeah, so that was a red herring then. but either way, the data still seems wrong, even if this check was;t the right one…

I did use the plain ZLIb stream first, that too fails (with the more generic “invalid state”) on the first byte.

although, now that I think of it: what’s the definition of “first” nibble? for 78, the 8 is the lower nibble, and COULD be argued to be there first.

yup. guess I just have go digging for additional zlib options, or debug deeper as to why the ones I tried don’t like this even though it looks correct…

I’ll keep you updated.

thanx!
marc

dwarfland · May 28, 2018, 8:45pm

Wohoo! ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream works!

epourmal · May 28, 2018, 8:57pm

Marc,

Congratulations!

If you need to use it from C, please see deflate filter implementation or the example I pointed you too. One uses inflate function another uncompress. I guess you found inflate equivalent

Please keep us posted about your reader and any issues you find or suggestion you have for the File Format Spec.

Thank you!

Elena

dwarfland · May 28, 2018, 9:01pm

will do! to be honest, with this working I’ve covered everything I need for this concrete project that I’m tasked with/hired for. There’s nots of cases I don’t cover yet, only there ones hit by the file(s) I need to work with, which come form a specific source ;).

I can now work the actual logic upon that data. Once the project is done (and paid ;), i’ll look at making the HDF5 reading portion open. More work will be needed to make it more “universally” usable for any HDF5 file, of course.

thanx for all your help!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Stuck trying figure out how to read GZIP data from HDF5