I have a colleague writing a 1.5 gigabyte HDF5 file with tons of tiny datasets in it. I asked him to use H5P_LIBVER_LATEST and it reduced file size by only about 10%. However, even after eye-balling his Fortran code, I am not convinced it is really taking effect. I am looking for a quick-look tool that can tell me if it, or at least some objects in it, are using LIBVER_LATEST. I tried h5stat -F but a) it seems to take forever to produce an output for this file and b) even after getting the output, I couldn’t make heads or tails about what it means in terms of internal data structure versions.
I like h5dump -BH for quick checking the superblock number, which is the first answer you requested. It slows down for files of many Gbytes, but should be quick for your 1.5 Gb file. This does NOT inventory all objects in the file, only certain top level ones. Here is a live example from a file created with LIBVER_EARLIEST.
h5dump -BH air_temp.ens901.monthly.nc | grep -i vers
… and a couple hits on some attribute names.
Ok, that is usfeul. Thx. But, it really does take forever. And, I honestly don’t know how to interpret the output. What does SUPERBLOCK_VERSION 0 mean? Or What does SUPERBLOCK_VERSION 2 and SYMBOLTABLE_VERSION 0 mean? I feel like there needs to be a way of asking, was this file and everything in it produced by aapplication(s) that set LIBVER_LATEST?
Please see this document that explains changes to the H5Pset_libver_bounds and which features/APIs trigger a specific version of the object header messages including the version of the superblock (see section 5).
I am wondering if using a split file driver will help (it separates metadata from raw data and creates two file) to see quickly the sizes of the internal structures?
We will be very interested in reproducing the issue here. Would it be possible for you to outline the set of APIs that create the file?
Also, which version of HDF5 is used? Is group hierarchy flat? Is H5Pset_libver_bounds function called with the earliest version set to the latest files format? From the current description it is really hard to interpret the result.
Thanks for the doc reference. After explicitly searching for the strings “SUPERBLOCK_VERSION”, “FREELIST_VERSION”, “SYMBOLTABLE_VERSION” and “OBJECTHEADER_VERSION” (produced by I think version 1.8 h5dump tool) in that document, I get only one hit and that is for SUPERBLOCK_VERSION. I read the parts of the document where I got hits for that and surmize (though not 100% confidentally) that a ‘2’ for that means libver latest in 1.8 and 1.10. Good. But, it would be good if the tool printed something like “latest available in version 1.8 of the library the file was created with” (if something like that makes sense). And, it would be good if the nomenclature printed by tools like h5dump matched documentation and AFAICT, it doesn’t seem to.
Next, if all these version designators are available near the front (head or superblock) of the file, then why does one need to scan all 1.5 gigabytes of the file to obtain it? I mean, isn’t there a tool (or way of inoking a tool) that doesn’t stream through the entire file to print information like this? Is it possible to maybe terminate on first encounter (maybe like diff -q does when indicating if files are different or not)?
Finally, I had my file gzip compressed (from 1.5 Gigs down to ~70 megs), I wanted to do something like…
gunzip < myfile.h5.gz | h5dump -BH - | grep -i vers
But, it doesn’t look like chaining of HDF5 tools like this is possible yet. I would recommend adding support for the standard unix I/O redirection operations where appropriate including the dash as a file designator. Could be very helpful in situations like this.
The split file driver is a good idea for help to narrow in on what is going on. I will contact my colleague and try that next week. @koziol has a copy of the file and am hoping to hear from him regarding what he thinks.
This was done with version 1.8.16 I think. Group hierarchy is ~8 layers and pretty wide fanout in spots. I also had this user write all datasets < 10 ints (or doubles or floats or whatever) as attributes instead. Still, only ~10% reduction in size and I was expecting more like 3-5x reduction.