Obtain storage overhead usage per group/dataset

At my job we’re trying to optimize our readtimes for particular datasets. We are exploring various data layouts and chunking options. A constraint is that our file should not get bigger and writes shouldn’t get significantly worse.

Our first naïve attempt showed an across the board reduction to readtimes with negligible impacts to writes using a constant chunk size and data layout for every dataset (around 200 of them in 1 file). Certain instances of our files however, showed an increase in file size.

We used h5stat to reveal that we obtained an extra 25MB of overhead from metadata, while keeping the same raw data size. So the file got the bigger, but the actual compressed data was roughly the same in size.

We see no option in h5stat nor h5dump to see how much metadata is used per dataset for compression. Looking through h5stat code, we see this struct being used to obtain the metadata of the overall file, but it doesn’t report it per dataset.

We would like to see that value per dataset, so we can determine what the problem datasets are. Looking at the compression ratio provided by h5dump, we were able to back calculate that number and verify that it does not account for overhead from the chunked storage.

Looking at h5dump, we see that all of our datasets are compressing well, we have a compression ratio >> 1 for most datasets.

Our pathforward is to investigate a non uniform layout i.e. a custom layout per dataset with custom chunk rules, maybe even making certain datasets compact (all information stored in the header), if the overhead from chunking metadata is too high and the data is small enough.

Is there a builtin method in h5stat or other tool to obtain such a granular breakdown? I wrote an issue over at h5py about getting the information through their API. We are willing to do the PR over there but we are not intimately familiar with HDF5 technology and want to make sure that the struct we are looking at through h5stat source code would provide all the necessary information to obtain the storage overhead from chunked datasets.

EDIT:

We are not too keen on post processing our generated HDF5 files and splitting the datasets as 1 file per dataset. These are generated straight from our simulation and generate a few million of them, ranging in size of 25MB upwards to 2.5GB. We would like a tool to see this breakdown so we can get the best fit across all the variation that is found in our few million hdf5 files.

Can you give us an idea of your datasets’ datatype (fixed or variable size?), the typical extent (dimensions), and your chunk sizes? Which version(s) of the library are you using?

Aside from user-level metadata (a.k.a. attributes), there aren’t many candidates for file-level metadata other than the chunk index. A trivial example where you’d notice metadata overhead would be a small dataset with a single chunk that compresses well. Depending on the superblock level, you might have a b-tree or array-based chunk index, but this would all be overhead compared to a small raw data payload. You’d be penalized on reading and writing via code path length, complexity, and the low throughput of the filter pipeline. (As you said, a compact layout typically would perform much better.)

Yes, you are looking at the right library structures, but I think the overhead question can be settled without digging too deeply into the format.

G.

Hi Gerd,

Thank you for the reply.

I will be getting for you a full dump and attaching as an excel file.

The file will contain

uncompressed size, compressed size, comp ratio, shape, chunk size, max dimesions, dataset layout, gzip compression level, other filters used.

munged_hybridc2_info.xlsx (720.5 KB)

I pulled our current test implementation.

Any advice on improvements would be great. I pulled all information with h5py.

One question i have is, why do the compact datasets have a compression ratio of less than 1. I realize they are not compressed so it doesn’t make sense to have one at all, but I just calculate it for every dataset.

What I am confused about is, why isn’t it 1. The storage_size reported by through this interface and the size in memory through the

> numpy datatype. Maybe that’s it, the numpy representation is slightly bigger than the raw data itself.

Thanks for the spreadsheet. I need to take a closer look.

Take for example /table_244

[('column_0', '<f8'), ('column_1', '<f8'), ('column_2', '<i4'), ('column_3', '<f8'), ('column_4', '<f8'), ('column_5', '<f8'), ('column_6', '<f8'), ('column_7', '<f8'), ('column_8', '<i4'), ('column_9', 'O'), ('column_10', '<f8'), ('column_11', '<f8'), ('column_12', '<f8'), ('column_13', '<f8'), ('column_14', '<f8'), ('column_15', '<f8'), ('column_16', '<f8'), ('column_17', '<f8'), ('column_18', '<f8'), ('column_19', '<f8'), ('column_20', '<f8'), ('column_21', '<f8'), ('column_22', '<f8'), ('column_23', '<f8'), ('column_24', '<f8'), ('column_25', '<f8'), ('column_26', '<f8'), ('column_27', '<f8'), ('column_28', '<f8'), ('column_29', '<f8'), ('column_30', '<f8'), ('column_31', '<f8'), ('column_32', '<f8'), ('column_33', '<f8')]

column_9 is of NumPy type O (Python object). What kind of Python object is that & how is that represented in HDF5? Can you send us the h5dump output (datatype definition) for that dataset?

G.

I can answer this quickly.

Fixed length strings come in as numpy dtype 'S'. If you see a 'O' numpy dtype its because the HDF5 type is a variable length string and its being read in as a python bytes instance.

To read the dtypes

'<f8' is endianess-datatype-num_bytes … big-endian float-64bits(8 bytes)

| character |                                datatype |
|----------:|----------------------------------------:|
| '?'       | boolean                                 |
| 'b'       | (signed) byte                           |
| 'B'       | unsigned byte                           |
| 'i'       | (signed) integer                        |
| 'u'       | unsigned integer                        |
| 'f'       | floating-point                          |
| 'c'       | complex-floating point                  |
| 'm'       | timedelta                               |
| 'M'       | datetime                                |
| 'O'       | (Python) objects                        |
| 'S', 'a'  | zero-terminated bytes (not recommended) |
| 'U'       | Unicode string                          |
| 'V'       | raw data (void)                         |

If O is stored as a variable-length string, that might explain the difference. Variable-length strings are stored in a global heap. The value (the array) of datasets with a compact storage layout is stored in the object header only if it’s of a fixed-size datatype. If you look at the value of an O field stored in the object header, you’ll see just the heap ID as a value rather than the value (string). Since the heap ID is fixed size, the size of the value in the object header will typically underestimate the true size (i.e., the string will be larger than a heap ID). Does that make sense?

G.

That makes perfect sense. The data in the header is a pointer to the string in the heap not the string itself and that pointer could be smaller than the string.