Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Quick way to use h5xxx tool to test if LIBVER_LATEST in effect

miller86 May 21, 2018, 8:59pm 1

I have a colleague writing a 1.5 gigabyte HDF5 file with tons of tiny datasets in it. I asked him to use H5P_LIBVER_LATEST and it reduced file size by only about 10%. However, even after eye-balling his Fortran code, I am not convinced it is really taking effect. I am looking for a quick-look tool that can tell me if it, or at least some objects in it, are using LIBVER_LATEST. I tried h5stat -F but a) it seems to take forever to produce an output for this file and b) even after getting the output, I couldn’t make heads or tails about what it means in terms of internal data structure versions.

dave.allured May 22, 2018, 12:17am 2

I like h5dump -BH for quick checking the superblock number, which is the first answer you requested. It slows down for files of many Gbytes, but should be quick for your 1.5 Gb file. This does NOT inventory all objects in the file, only certain top level ones. Here is a live example from a file created with LIBVER_EARLIEST.

h5dump -BH air_temp.ens901.monthly.nc | grep -i vers
SUPERBLOCK_VERSION 0
FREELIST_VERSION 0
SYMBOLTABLE_VERSION 0
OBJECTHEADER_VERSION 0

… and a couple hits on some attribute names.

–Dave

miller86 May 23, 2018, 12:24am 3

Ok, that is usfeul. Thx. But, it really does take forever. And, I honestly don’t know how to interpret the output. What does SUPERBLOCK_VERSION 0 mean? Or What does SUPERBLOCK_VERSION 2 and SYMBOLTABLE_VERSION 0 mean? I feel like there needs to be a way of asking, was this file and everything in it produced by aapplication(s) that set LIBVER_LATEST?

epourmal May 23, 2018, 1:21am 4

All,

Please see this document that explains changes to the H5Pset_libver_bounds and which features/APIs trigger a specific version of the object header messages including the version of the superblock (see section 5).

Thank you!

Elena

epourmal May 23, 2018, 1:27am 5

Hi Mark,

I am wondering if using a split file driver will help (it separates metadata from raw data and creates two file) to see quickly the sizes of the internal structures?

We will be very interested in reproducing the issue here. Would it be possible for you to outline the set of APIs that create the file?

Also, which version of HDF5 is used? Is group hierarchy flat? Is H5Pset_libver_bounds function called with the earliest version set to the latest files format? From the current description it is really hard to interpret the result.

Thank you!

Elena

miller86 May 24, 2018, 4:16am 6

Thanks for the doc reference. After explicitly searching for the strings “SUPERBLOCK_VERSION”, “FREELIST_VERSION”, “SYMBOLTABLE_VERSION” and “OBJECTHEADER_VERSION” (produced by I think version 1.8 h5dump tool) in that document, I get only one hit and that is for SUPERBLOCK_VERSION. I read the parts of the document where I got hits for that and surmize (though not 100% confidentally) that a ‘2’ for that means libver latest in 1.8 and 1.10. Good. But, it would be good if the tool printed something like “latest available in version 1.8 of the library the file was created with” (if something like that makes sense). And, it would be good if the nomenclature printed by tools like h5dump matched documentation and AFAICT, it doesn’t seem to.

Next, if all these version designators are available near the front (head or superblock) of the file, then why does one need to scan all 1.5 gigabytes of the file to obtain it? I mean, isn’t there a tool (or way of inoking a tool) that doesn’t stream through the entire file to print information like this? Is it possible to maybe terminate on first encounter (maybe like diff -q does when indicating if files are different or not)?

Finally, I had my file gzip compressed (from 1.5 Gigs down to ~70 megs), I wanted to do something like…

gunzip < myfile.h5.gz | h5dump -BH - | grep -i vers

But, it doesn’t look like chaining of HDF5 tools like this is possible yet. I would recommend adding support for the standard unix I/O redirection operations where appropriate including the dash as a file designator. Could be very helpful in situations like this.

miller86 May 24, 2018, 4:19am 7

The split file driver is a good idea for help to narrow in on what is going on. I will contact my colleague and try that next week. @koziol has a copy of the file and am hoping to hear from him regarding what he thinks.

This was done with version 1.8.16 I think. Group hierarchy is ~8 layers and pretty wide fanout in spots. I also had this user write all datasets < 10 ints (or doubles or floats or whatever) as attributes instead. Still, only ~10% reduction in size and I was expecting more like 3-5x reduction.