Access to 'miscellaneous dataset information'/get size of compressed dataset


#1

Hi all,
is there a way to get access to the information of a dataset which is displayed in the ‘General Object Info’ tab?
Especially I need information about the compression of a compressed dataset, e.g. with GZIP. In ‘Miscellaneous Dataset Information’ I can see the compression ratio and the storage size.
But if I try to get the size of the dataset with <dataset>.size, I only get the uncompressed value. Is this behavior correct?
I’m using h5py=3.2.1, hdf5=1.12.0 and Python 3.9.5.

Thank you for your support!
Jan


#2

Hi,
I dont work with python, but the filter call has this information, and should pass it to chunk write. I am typing from a phone, so lets see what others have to say. Or you can dig around in the capi filters, see the function prototype and then see how it is used in python.

Steve


#3

Hi Steven,
thank you for your suggestions. I have looked where the filters in Python are called, but even there the only available size is the one of the uncompressed data. I’m sorry but I’m not familiar in C, I think this is the reason why I didn’t find anything in the C API.
Do you have any idea how I can get the information I want?

Jan


#4

Hi Jan,
I am not so certain if I pointed you at the right direction, in the case my apologies. This is how the callback should look like, and you are interested in the nbytes passed to the function and the returned value. What makes it complex is the filters are usually implemented as shared objects – loaded by the library on demand. So you may not have direct access to it.

typedef size_t (*H5Z_func_t) (unsigned int flags, size_t cd_nelmts, const unsigned int cd_values[], size_t nbytes, size_t *buf_size, void **buf)

If successful, the filter operation callback function returns the number of valid bytes of data contained in buf. In the case of failure, the return value is 0 (zero) and all pointer arguments are left unchanged.

ratio = nbytes / ret_val

my apologies if it lead you to the wrong direction. I am only familiar with the C API calls.

best: steve


#5

Hi Steve,
thanks again for your suggestions and no worry, it was a great help for me. Meanwhile I’ve tried again to use the C API and I can get the correct storage value with H5Dget_storage_size(dset), the type is hsize_t - that’s perfect, now I’ll have a look if this function is also available in Python.
In C I don’t know how to get the size of the uncompressed dataset (so exactly the other way around like in Python).

Jan


#6

I would recommend looking at the h5py documentation (https://docs.h5py.org/en/stable/high/dataset.html for datasets). Most of the information about HDF5 datasets is available as an attribute, e.g., h5py.File('filename.h5')['path/to/field'].compression should return the compression filter or None if none were applied.


#7

Hi Steve,

I’ve found the solution in the low-level API of h5py.
The dataset’s id object (class h5py.h5d.DatasetID) owns a method called get_storage_size which returns the storage size used by a dataset.
Thank you for your help!

All best
Jan


#8

I apologize for skimming the question too quickly. I thought you were asking for more basic information about datasets, which you obviously already knew.


#9

Hi @jan!

Yes, you are on the right track. Below is a Python example that prints compression ratio for every HDF5 dataset in a file:

import h5py


def comp_ratio(name, obj):
    if isinstance(obj, h5py.Dataset) and obj.chunks is not None:
        dcpl = obj.id.get_create_plist()
        if dcpl.get_nfilters():
            stor_size = obj.id.get_storage_size()
            if stor_size != 0:
                ratio = float(obj.nbytes) / float(stor_size)
                print(f'Compression ratio for "{obj.name}": {ratio}')


fname = 'example.h5'
with h5py.File(fname, mode='r') as h5f:
    h5f.visititems(comp_ratio)

-Aleksandar


#10

Hi @ajelenak!

Thank you very much for your code example, it works pretty fine and that’s exactly what I was looking for!

All best
Jan