Getting all keys from HDF5 takes some time (H5PY) (SOLVED)

torbjorns · November 14, 2022, 2:40pm

Hi,
I have a a lot of h5 files (size 1.5-3GB) with typically 3K keys on the root.
Using the keys() function (my_file.keys()) to extract all these keys takes about 5 seconds which in itself isn’t that bad.
However, I need to do this on potential 10K files every day which is not optimal.
In the key list there is one Group and the rest are datasets. I only need the key name of the Group.
Is there a way to access this group (without knowing its name) than using keys() in Python (H5PY).

Thanks.
(I apologize if this is posted in the wrong place)

gheber · November 14, 2022, 3:36pm

In HDF5 files, groups are implemented as collections of (named) links. I’m not an h5py expert (not even an amateur!), but I imagine the keys function is implemented as a traversal over such a collection and, perhaps, acquiring a bit of information about each link along the way. Unfortunately, from just looking at a link, one cannot tell if the destination is a dataset, a group, something else, or nothing at all. All you can do is try to limit the “damage,” i.e., iterate over the group via h5py.h5g.iterate and in the accompanying callback, determine if the destination is a group, and stop iterating if you’ve found one. Any other heuristics, e.g., about the link name, might also be used in this callback. Best case: the link to the group is first in the collection. Worst case: the link to the group in question is last in the collection.

OK? G.

hyoklee · November 14, 2022, 3:47pm

Hi, @torbjorns!

Thank you for posting an interesting problem.
Do you have to use h5py?
My suggestion is to use CLI for this purpose.

Here’s an example that illustrates why.

bash-3.2$ time h5ls  ~/data/ATL11_051911_0313_005_01.h5 
METADATA                 Group
ancillary_data           Group
orbit_info               Group
pt1                      Group
pt2                      Group
pt3                      Group
quality_assessment       Group

real	0m0.007s

Python equivalent (your case):

bash-3.2$ time python keys.py
<KeysViewHDF5 ['METADATA', 'ancillary_data', 'orbit_info', 'pt1', 'pt2', 'pt3', 'quality_assessment']>

real	0m0.539s

My recommendation is to use CLI and feed the output to ElasticSearch.
Overall, I think your data management team needs consulting from us because do this potential 10K files every day doesn’t sound right.

torbjorns · November 15, 2022, 12:02pm

Hi @gheber

Thank you for reply.
I tried out your iteration suggestion and it seems to be a better way as the group (name) I’m looking for is always the first.
However, I also learned that the waiting time I expected to origin from getting the keys is actually from getting the hdf5 file object (which is stored in the cloud) using the h5py.File function.

Anyways, thank you.

torbjorns · November 15, 2022, 12:02pm

Hi @hyoklee

Thank you for your reply.
I tried out you suggestion but it seems to take even longer time. Maybe I’m measuring wrong?
Bash:
$ time h5ls Z:/databases/746924132/hdf5/1047397987.h5 >/dev/null (to avoid output)

real    0m32.043s
user    0m0.000s
sys     0m0.015s

PowerShell

Measure-Command {h5ls Z:\databases\746924132\hdf5\1047397987.h5}
…
Seconds : 33
Milliseconds : 227
Ticks : 332277584
TotalDays : 0.000384580537037037
TotalHours : 0.00922993288888889
TotalMinutes : 0.553795973333333
TotalSeconds : 33.2277584
TotalMilliseconds : 33227.7584

Running the powershell script a second time is done in no time, so I guess there is some caching present.

With regards to the number of files that we read every day. I think it may be hard to get around as each file corresponds to one signal sensor. It is also provided by another company and nothing we control.

Anyways, thanks for your input.

gheber · November 15, 2022, 12:45pm

You might wanna peek at the implementation of h5py.File. Perhaps it’s trying to be helpful by scanning the links in the root group. You can use the h5py.h5f.open low-level function to get around this, which should be more or less a straight call to H5Fopen. So a combo of h5py.h5f.open and h5py.h5g.iterate should be as good as it gets.

G.

hyoklee · November 15, 2022, 1:00pm

Thanks for trying on Windows!
Can you share the file?
Is z: drive mounting cloud store like AWS S3?

contact · November 15, 2022, 4:18pm

Hi @torbjorns,

The current HDF5 C API (and, consequently, most of its existent wrappers) does not provide a proper mechanism to create indexes that can be used to (greatly) speed-up querying the structure of an HDF5 file - e.g. searching a certain object (e.g. group) amongst (tens of) thousands of other objects within an HDF5 file. This means that unless users develop their own solutions (e.g. populate a side persistence technology, e.g. MongoDB, with the structure of an HDF5 file to enable performant queries afterwards), they are left with the (rather cumbersome) option of traversing all the objects stored within an HDF5 file every time they need to search something of interest. This approach may raise issues being the lack of performance the most prominent one (like you described in this post).

It is in HDFql roadmap to introduce indexing capabilities in the “world” of HDF5. (FYI, HDFql is a high-level (declarative) language that abstracts users from HDF5 low-level details.) Basically, what we are envisioning is to enable users to create indexes containing information about the structure of an HDF5 file - indexing HDF5 (meta)data itself is out of the scope of this roadmap though. Canonically speaking, users could create indexes as follows (in HDFql):

CREATE [TRUNCATE] [INTERNAL | EXTERNAL] INDEX [FROM [USE FILE | hdf5_file_name]]

Depending on what users would specify, these indexes are either stored 1) internally in the HDF5 file in question and in a well-known dataset or 2) externally in a well-known side file managed by HDFql. After creating indexes, executing (HDFql) operations such as this one:

SHOW LIKE **/my_group

… would make HDFql to check if indexes for that particular HDF5 file exist (by first checking internally and then externally if the former does not exist). If indexes exist, these are used to speed-up queries; if indexes do not exist, HDFql would just fall back on traversing all objects like it does currently (and like all other HDF5 APIs do for that matter). Our aim is to have this (yet-to-be-developed) indexing mechanism in HDFql at least 10x faster than the traditional approach of traversing all objects using the HDF5 C API (as the reference).

Hope it helps & stay tuned!

hyoklee · November 15, 2022, 5:40pm

@contact, history repeats: Indexing and Fast-Query API? - #7 by Quincey_Koziol (2008)

Our aim is to have this (yet-to-be-developed) indexing mechanism in HDFql at least 10x faster
(2022)

Please make sure that there will be enough market in 2036.

If you feel confident, start early.
It’s won’t be easy but I want to see it happening before I die.

contact · November 16, 2022, 11:18am

Hi @hyoklee,

AFAWK, the post you have shared refers to indexing HDF5 (meta)data to speed-up filtering data according to user-defined criteria afterwards. What we are envisioning for HDFql is to only index the structure of the HDF5 file (i.e. the names of (nested) objects and their organization within the file) though - this should satisfy the original issue (i.e. lack of performance when trying to identify a group amongst thousands of datasets) raised by @torbjorns.

It seems that what is presented/discussed in this post has some kind of overlap with MIQS and, eventually, also with HDF5 H5Q/H5X APIs. It would be great if all these disparate efforts could be federated into one single approach/solution - this would greatly benefit the HDF5 community at large.

That said, we also see an overlap between what we envision and HDF5 H5Q/H5X APIs but only the part that touches HDF5 structure. Indexing the structure of an HDF5 file should be easier to implement (when compared with indexing HDF5 (meta)data) and, hopefully, feasible before 2036

Hope it helps!

torbjorns · November 21, 2022, 2:18pm

Hi all,
To help others, if any, working with low level hdf5 I’m posting my results here.
By using low-level I was able to reduce extraction of data from 5-30 seconds to less than a second, saving me a lot of time.

import h5py
import numpy as np
from datetime import datetime


# function used to get the first item of an iterator
def call_me(a):
    return a


if __name__ == '__main__':
    start = datetime.now()
    my_file_obj = h5py.h5f.open(b'Z:\\databases\\746924132\\hdf5\\276465213.h5', flags=0, fapl=None)
    print(f"Load file object: {(datetime.now() - start).total_seconds()}")

    start = datetime.now()
    group_id = h5py.h5g.iterate(my_file_obj, call_me)
    print(f"Get first group id from root: {(datetime.now() - start).total_seconds()}")
    print(group_id)

    start = datetime.now()
    grp = h5py.h5g.open(my_file_obj, group_id)
    print(f"Load group from from root: {(datetime.now() - start).total_seconds()}")

    # open dataset to read (I build this string manually, but there are several datasets directly on the root
    # and '276465213_-2144474365_1541280351001' is one of them.)
    _dset = h5py.h5d.open(my_file_obj, b'276465213_-2144474365_1541280351001')
    # getting the attributes for the dataset
    _num_attrs = h5py.h5a.get_num_attrs(_dset)
    _attrs = {}
    for i in range(_num_attrs):
        _attr = h5py.h5a.open(_dset, index=i)
        num = np.empty(shape=_attr.shape, dtype=_attr.dtype)
        _attr.read(num)
        value = num[0]
        # storing the attribute name and value in a dictionary
        _attrs[_attr.get_name().decode("utf-8")] = value

    start = datetime.now()
    # getting the id of one of my dataset. This dataset is in one of the root groups
    dset_id = h5py.h5g.iterate(grp, call_me)
    print(f"Get first dataset id: {(datetime.now() - start).total_seconds()}")
    print(dset_id)

    start = datetime.now()
    # get dataset object
    dset = h5py.h5d.open(grp, dset_id)
    print(f"Get first dataset: {(datetime.now() - start).total_seconds()}")
    # create numpy array to store data in
    num = np.empty(shape=dset.shape, dtype=dset.dtype)
    # load data into numpy array
    dset.read(dset.get_space(), dset.get_space(), num)

Hopefully this can help others save time.
Have a nice day.
(let me know if there is some bugs in the code, it’s not my final code)

gheber · November 21, 2022, 3:23pm

If you know that it’s always the first (or n-th) link you might even get away with h5py.h5g.get_objname_by_idx, i.e.,

dset = h5py.h5d.open(grp, h5py.h5g.get_objname_by_idx(0))

but I wouldn’t call this “defensive programming.”

G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Getting all keys from HDF5 takes some time (H5PY) (SOLVED)