BTW when you’re done with your library it would be great if you posted a
link.
Side note: From time to time I’ve thought about writing a minimal C library
to read a subset of HDF5 myself. The reason being that we are doing
multi-threaded reading of multiple HDF5 files in a GUI app (to avoid
blocking the UI), but the fact that the HDF5 library handles thread safety
by simply taking a global lock (so is not thread efficient) is becoming an
annoyance, since large reads can block small ones for quite some time. Our
requirements are not big, we simply need to be able to read chunked
compressed datasets, and we’re only using the “old” pre-1.10 format. So I
don’t think a minimal thread safe/thread efficient C library to do just
that would be that big an effort. We have no need for writing files, and we
have no need for HPC features like MPI et.c.
If anyone else reading this has already written such a library (I’m
thinking maybe for embedded applications?), please shout out!
Indeed, there are several alternative implementations of HDF5 data
interface:
If I got the binary data from the raw data chunk, and the only filter present is a “deflate”, should I be able to pass tis data straight into a GZIP decompress API to get the final results? Because I’m trying to do that using .NET’s System.IO.Compression.GZipStream, and it says the data is corrupted…
I suspect the deflate filter saves the raw deflate stream, not GZip format (which is deflate stream + gzip header/trailer), but check the HDF5 source to be sure.
Not familiar with .NET myself, but System.IO.Compression.GZipStream sounds like a class that deals with GZip format (including header/trailer). Perhaps System.IO.Compression.DeflateStream is the class to use?
Though I now found this SO answer suggesting DeflateStream is not in fact compatible with ZLIB’s deflate algorithm: https://stackoverflow.com/a/70658/252857 So it may be that you are out of luck
Tried a different library, same/similar error. I think I’m missing something else and this isn’t actually raw/valid GZIP data I’m looking at. Will investigate more when I have time…
Maybe try printing the first few bytes of the chunk data in HDF5’s own H5Zdeflate.c and compare with what you have, to make sure you’re reading from the right spot?
BTW I was reading the spec a little, and I’m curious: When you traverse the v1 b-tree to find a chunk, what is the comparison used between chunks? Because to me I think that’s one of the things that look a little under-specced. The spec mentions that chunks are order in the tree by their index into the dataset they belong to (that’s obvious), but it doesn’t mention if the comparison used first compares by the index in the slowest changing dimension, then the index in the next to slowest dimension and so on, or if it’s the other way around. Did you discover this yourself, or was it obvious to you after reading the spec?
In short, what definition of “less than” is used for chunks in the b-tree?
And also, when v1 btrees are used for keeping group children, I guess the children are ordered lexicographically in the tree? (but I can’t find this specified explicitly either).
And also, when v1 btrees are used for keeping group children, I guess
the children are ordered lexicographically in the tree? (but I can’t
find this specified explicitly either).
I think documentation on the old H5Giterate and newer H5Literate discuss this issue somewhat. But, only insofar as the API itself is specified, not the actual internal storage implementation used by the library. And, I think that is best you can rely upon as the lib internals are not part of the API specification.
Thanks, but that seems to just control whether a separate index is also created for tracking the creation order. I’m pretty sure the primary index is ordered lexicographically by link name? (to allow fast lookup of paths).
I just wondered whether this was specified anywhere in the spec. Even if it’s kind of obvious in the case of group links, I think the spec is incomplete without specifying by which criterion the indices are ordered.
I was more interested in the ordering criterion used for the chunk index, since I think that is less obvious. I can’t find that in the spec either.
Thanks, but I was talking from the standpoint of a third party implementor of the spec, not as user of the HDF5 library. To be able to correctly implement a reader, one must know by which criterion the indices are ordered, and it seems this info is left out of the spec?
Oh, really? I hadn’t understood that was your aim. So, you mean to achieve a bytes-on-disk arrangement that matches what HDF5 lib expects without using HDF5 lib implementation? I guess I should have read the whole thread before commenting
it’s the beginning of. block of data that looks “random” (preceded by lots of zeros and more structured “HDF5-looking” bytes, so I’m tempted to believe it’s the correct address — but maybe the raw chunk doesn’t contain the GZIP data right away, but has another preamble or header? I’m not sure what a GZIP compressed stream of data should look like, but it fails right on the first byte(s) it reads (with the bit-more-precise error “Message: Bad state (invalid stored block lengths)”, when using the open source gzip implementation).
Well, a hypothetical standpoint for my own part (so not a direct aim at this point). Though I have had thoughts about writing a simple reader for subset of the format. Marc, the original poster, is the one actually working on an implementation. I’m just an interested bystander.
What I was thinking was reading, not writing, HDF5 files (and is what Marc is doing).
If I read RFC 1950 correctly, a ZLIB stream in “deflate” format (which I think is what H5Zdeflate.c filter uses, by using the compress2 function, someone correct me if I’m wrong) should start with the nibble 1000b = 0x8, while your data starts with 0111b = 0x7.
Ok. then what could I be missing, given this is the address that my DataLayout message points to? Am I getting the wrong address, or is it the right address, but the data is in a different format?
my DataLayoutObjectHeaderMessage is at 0x0BC0 and points to a tree at 0x0CA0