Getting all keys from HDF5 takes some time (H5PY) (SOLVED)


#1

Hi,
I have a a lot of h5 files (size 1.5-3GB) with typically 3K keys on the root.
Using the keys() function (my_file.keys()) to extract all these keys takes about 5 seconds which in itself isn’t that bad.
However, I need to do this on potential 10K files every day which is not optimal.
In the key list there is one Group and the rest are datasets. I only need the key name of the Group.
Is there a way to access this group (without knowing its name) than using keys() in Python (H5PY).

Thanks.
(I apologize if this is posted in the wrong place)


#2

In HDF5 files, groups are implemented as collections of (named) links. I’m not an h5py expert (not even an amateur!), but I imagine the keys function is implemented as a traversal over such a collection and, perhaps, acquiring a bit of information about each link along the way. Unfortunately, from just looking at a link, one cannot tell if the destination is a dataset, a group, something else, or nothing at all. All you can do is try to limit the “damage,” i.e., iterate over the group via h5py.h5g.iterate and in the accompanying callback, determine if the destination is a group, and stop iterating if you’ve found one. Any other heuristics, e.g., about the link name, might also be used in this callback. Best case: the link to the group is first in the collection. Worst case: the link to the group in question is last in the collection.

OK? G.


#3

Hi, @torbjorns!

Thank you for posting an interesting problem.
Do you have to use h5py?
My suggestion is to use CLI for this purpose.

Here’s an example that illustrates why.

bash-3.2$ time h5ls  ~/data/ATL11_051911_0313_005_01.h5 
METADATA                 Group
ancillary_data           Group
orbit_info               Group
pt1                      Group
pt2                      Group
pt3                      Group
quality_assessment       Group

real	0m0.007s

Python equivalent (your case):

bash-3.2$ time python keys.py
<KeysViewHDF5 ['METADATA', 'ancillary_data', 'orbit_info', 'pt1', 'pt2', 'pt3', 'quality_assessment']>

real	0m0.539s

My recommendation is to use CLI and feed the output to ElasticSearch.
Overall, I think your data management team needs consulting from us because do this potential 10K files every day doesn’t sound right.


#4

Hi @gheber

Thank you for reply.
I tried out your iteration suggestion and it seems to be a better way as the group (name) I’m looking for is always the first.
However, I also learned that the waiting time I expected to origin from getting the keys is actually from getting the hdf5 file object (which is stored in the cloud) using the h5py.File function.

Anyways, thank you.


#5

Hi @hyoklee

Thank you for your reply.
I tried out you suggestion but it seems to take even longer time. Maybe I’m measuring wrong?
Bash:
$ time h5ls Z:/databases/746924132/hdf5/1047397987.h5 >/dev/null (to avoid output)

real    0m32.043s
user    0m0.000s
sys     0m0.015s

PowerShell

Measure-Command {h5ls Z:\databases\746924132\hdf5\1047397987.h5}

Seconds : 33
Milliseconds : 227
Ticks : 332277584
TotalDays : 0.000384580537037037
TotalHours : 0.00922993288888889
TotalMinutes : 0.553795973333333
TotalSeconds : 33.2277584
TotalMilliseconds : 33227.7584

Running the powershell script a second time is done in no time, so I guess there is some caching present.

With regards to the number of files that we read every day. I think it may be hard to get around as each file corresponds to one signal sensor. It is also provided by another company and nothing we control.

Anyways, thanks for your input.


#6

You might wanna peek at the implementation of h5py.File. Perhaps it’s trying to be helpful by scanning the links in the root group. You can use the h5py.h5f.open low-level function to get around this, which should be more or less a straight call to H5Fopen. So a combo of h5py.h5f.open and h5py.h5g.iterate should be as good as it gets.

G.


#7

Thanks for trying on Windows!
Can you share the file?
Is z: drive mounting cloud store like AWS S3?


#8

Hi @torbjorns,

The current HDF5 C API (and, consequently, most of its existent wrappers) does not provide a proper mechanism to create indexes that can be used to (greatly) speed-up querying the structure of an HDF5 file - e.g. searching a certain object (e.g. group) amongst (tens of) thousands of other objects within an HDF5 file. This means that unless users develop their own solutions (e.g. populate a side persistence technology, e.g. MongoDB, with the structure of an HDF5 file to enable performant queries afterwards), they are left with the (rather cumbersome) option of traversing all the objects stored within an HDF5 file every time they need to search something of interest. This approach may raise issues being the lack of performance the most prominent one (like you described in this post).

It is in HDFql roadmap to introduce indexing capabilities in the “world” of HDF5. (FYI, HDFql is a high-level (declarative) language that abstracts users from HDF5 low-level details.) Basically, what we are envisioning is to enable users to create indexes containing information about the structure of an HDF5 file - indexing HDF5 (meta)data itself is out of the scope of this roadmap though. Canonically speaking, users could create indexes as follows (in HDFql):

CREATE [TRUNCATE] [INTERNAL | EXTERNAL] INDEX [FROM [USE FILE | hdf5_file_name]]

Depending on what users would specify, these indexes are either stored 1) internally in the HDF5 file in question and in a well-known dataset or 2) externally in a well-known side file managed by HDFql. After creating indexes, executing (HDFql) operations such as this one:

SHOW LIKE **/my_group

… would make HDFql to check if indexes for that particular HDF5 file exist (by first checking internally and then externally if the former does not exist). If indexes exist, these are used to speed-up queries; if indexes do not exist, HDFql would just fall back on traversing all objects like it does currently (and like all other HDF5 APIs do for that matter). Our aim is to have this (yet-to-be-developed) indexing mechanism in HDFql at least 10x faster than the traditional approach of traversing all objects using the HDF5 C API (as the reference).

Hope it helps & stay tuned!


#9

@contact, history repeats: Indexing and Fast-Query API? (2008)

Our aim is to have this (yet-to-be-developed) indexing mechanism in HDFql at least 10x faster
(2022)

Please make sure that there will be enough market in 2036. :slight_smile:

If you feel confident, start early. :muscle:
It’s won’t be easy but I want to see it happening before I die. :pray:


#12

Hi @hyoklee,

AFAWK, the post you have shared refers to indexing HDF5 (meta)data to speed-up filtering data according to user-defined criteria afterwards. What we are envisioning for HDFql is to only index the structure of the HDF5 file (i.e. the names of (nested) objects and their organization within the file) though - this should satisfy the original issue (i.e. lack of performance when trying to identify a group amongst thousands of datasets) raised by @torbjorns.

It seems that what is presented/discussed in this post has some kind of overlap with MIQS and, eventually, also with HDF5 H5Q/H5X APIs. It would be great if all these disparate efforts could be federated into one single approach/solution - this would greatly benefit the HDF5 community at large.

That said, we also see an overlap between what we envision and HDF5 H5Q/H5X APIs but only the part that touches HDF5 structure. Indexing the structure of an HDF5 file should be easier to implement (when compared with indexing HDF5 (meta)data) and, hopefully, feasible before 2036 :slight_smile:

Hope it helps!


#13

Hi all,
To help others, if any, working with low level hdf5 I’m posting my results here.
By using low-level I was able to reduce extraction of data from 5-30 seconds to less than a second, saving me a lot of time.

import h5py
import numpy as np
from datetime import datetime


# function used to get the first item of an iterator
def call_me(a):
    return a


if __name__ == '__main__':
    start = datetime.now()
    my_file_obj = h5py.h5f.open(b'Z:\\databases\\746924132\\hdf5\\276465213.h5', flags=0, fapl=None)
    print(f"Load file object: {(datetime.now() - start).total_seconds()}")

    start = datetime.now()
    group_id = h5py.h5g.iterate(my_file_obj, call_me)
    print(f"Get first group id from root: {(datetime.now() - start).total_seconds()}")
    print(group_id)

    start = datetime.now()
    grp = h5py.h5g.open(my_file_obj, group_id)
    print(f"Load group from from root: {(datetime.now() - start).total_seconds()}")

    # open dataset to read (I build this string manually, but there are several datasets directly on the root
    # and '276465213_-2144474365_1541280351001' is one of them.)
    _dset = h5py.h5d.open(my_file_obj, b'276465213_-2144474365_1541280351001')
    # getting the attributes for the dataset
    _num_attrs = h5py.h5a.get_num_attrs(_dset)
    _attrs = {}
    for i in range(_num_attrs):
        _attr = h5py.h5a.open(_dset, index=i)
        num = np.empty(shape=_attr.shape, dtype=_attr.dtype)
        _attr.read(num)
        value = num[0]
        # storing the attribute name and value in a dictionary
        _attrs[_attr.get_name().decode("utf-8")] = value

    start = datetime.now()
    # getting the id of one of my dataset. This dataset is in one of the root groups
    dset_id = h5py.h5g.iterate(grp, call_me)
    print(f"Get first dataset id: {(datetime.now() - start).total_seconds()}")
    print(dset_id)

    start = datetime.now()
    # get dataset object
    dset = h5py.h5d.open(grp, dset_id)
    print(f"Get first dataset: {(datetime.now() - start).total_seconds()}")
    # create numpy array to store data in
    num = np.empty(shape=dset.shape, dtype=dset.dtype)
    # load data into numpy array
    dset.read(dset.get_space(), dset.get_space(), num)

Hopefully this can help others save time.
Have a nice day.
(let me know if there is some bugs in the code, it’s not my final code)


#14

If you know that it’s always the first (or n-th) link you might even get away with h5py.h5g.get_objname_by_idx, i.e.,

dset = h5py.h5d.open(grp, h5py.h5g.get_objname_by_idx(0))

but I wouldn’t call this “defensive programming.”

G.