At my job we’re trying to optimize our readtimes for particular datasets. We are exploring various data layouts and chunking options. A constraint is that our file should not get bigger and writes shouldn’t get significantly worse.
Our first naïve attempt showed an across the board reduction to readtimes with negligible impacts to writes using a constant chunk size and data layout for every dataset (around 200 of them in 1 file). Certain instances of our files however, showed an increase in file size.
We used h5stat
to reveal that we obtained an extra 25MB of overhead from metadata, while keeping the same raw data size. So the file got the bigger, but the actual compressed data was roughly the same in size.
We see no option in h5stat
nor h5dump
to see how much metadata is used per dataset for compression. Looking through h5stat
code, we see this struct being used to obtain the metadata of the overall file, but it doesn’t report it per dataset.
We would like to see that value per dataset, so we can determine what the problem datasets are. Looking at the compression ratio provided by h5dump
, we were able to back calculate that number and verify that it does not account for overhead from the chunked storage.
Looking at h5dump
, we see that all of our datasets are compressing well, we have a compression ratio >> 1 for most datasets.
Our pathforward is to investigate a non uniform layout i.e. a custom layout per dataset with custom chunk rules, maybe even making certain datasets compact (all information stored in the header), if the overhead from chunking metadata is too high and the data is small enough.
Is there a builtin method in h5stat
or other tool to obtain such a granular breakdown? I wrote an issue over at h5py
about getting the information through their API. We are willing to do the PR over there but we are not intimately familiar with HDF5 technology and want to make sure that the struct we are looking at through h5stat
source code would provide all the necessary information to obtain the storage overhead from chunked datasets.
EDIT:
We are not too keen on post processing our generated HDF5 files and splitting the datasets as 1 file per dataset. These are generated straight from our simulation and generate a few million of them, ranging in size of 25MB upwards to 2.5GB. We would like a tool to see this breakdown so we can get the best fit across all the variation that is found in our few million hdf5 files.