H5I_get_name call is very slow for HDF5 file > 5 GB

spenshel · March 4, 2021, 5:26pm

I’m reading an HDF5 file > 5 GB in size. I am calling H5I_get_name to determine, given a specific link that is assumed to be a hard link, the name of the path which the link targets. However, the call to H5I_get_name takes 10-30 seconds to complete, which is unacceptable since there are many thousands of links to examine (though the results from H5I_get_name are correct). Is there a way to make the H5I_get_name call faster, or call an alternative routine, to obtain the target path of a hard link? If you can help, thanks.

gheber · March 4, 2021, 9:41pm

Unless your file contains mostly metadata, the performance of H5Iget_name doesn’t have much to do with the file size.

Just for clarification (and to be pedantic ), you are providing an object handle (the argument to the obj_id parameter) from which you are trying to determine a path.

Can you tell us something about your environment? Which version of the HDF5 library are you using? Is the file (files?) you are trying to access stored on a local file system or on some kind of network share/NFS? Can you send us the output of h5stat?

I would expect to see poor H5Iget_name performance in a networked setup because you will feel every bit of latency unless your requests can be serviced out of one of the caches. Have you timed your calls or looked at the gperftools output? I’d be surprised if you were spending all that time in the HDF5 library.

Best, G.

spenshel · March 5, 2021, 3:11pm

Thanks for your response. I am using version 1.12 of the HDF5 library, the Java interface in Windows, with a copy of the large data file on my local hard drive. With a small file the call to H5I_get_name takes no time at all, but not so with the > 5 GB file.

Here is the output when I run h5stat on the file:

# of unique groups: 318422
# of unique datasets: 1292978
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 905

Superblock: 96
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 12736880/0
Datasets (exclude compact data): 351690016/185049968
Datatypes: 0/0
Groups:
B-tree/list: 304512648
Datatypes: 0/0
Attributes:
B-tree/list: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared messages:
Header: 0
B-tree/list: 0
Heap: 0
Free-space managers:
Header: 0
Amount of free space: 0
Small groups (with 0 to 9 links):
# of groups with 2 link(s): 44906
…
Total # of small groups: 263187
Group bins:
# of groups with 1-9 links: 263187
# of groups with 10-99 links: 55120
# of groups with 100-999 links: 115
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 0: 669362
# of dataset with rank 1: 623616
1-D Dataset information:
Max. dimension size of 1-D datasets: 168537
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 214146
…
Total # of small datasets: 386965
1-D Dataset dimension bins:
# of datasets with dimension size 1-9: 386965
…
Total # of datasets: 623616
Dataset storage information:
Total raw data size: 5035777471
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 1292978
Dataset layout counts[CHUNKED]: 0
Dataset layout counts[VIRTUAL]: 0
Number of external files: 0
Dataset filters information:
Number of datasets with:
NO filter: 1292978
GZIP filter: 0
…
USER_DEFINED filter: 0
Dataset datatype information:
#of unique datatypes used by datasets: 14
Dataset datatype #0:
Count (total/named) = (333684/0)
Size (desc./elmt) = (22/8)
Dataset datatype #1:
Count (total/named) = (813260/0)
Size (desc./elmt) = (14/8)
Dataset datatype #2:
Count (total/named) = (102699/0)
Size (desc./elmt) = (10/10)
Dataset datatype #3:
Count (total/named) = (741/0)
Size (desc./elmt) = (10/4)
Dataset datatype #4:
Count (total/named) = (1/0)
Size (desc./elmt) = (10/9)
Dataset datatype #5:
Count (total/named) = (1/0)
Size (desc./elmt) = (10/80)
Dataset datatype #6:
Count (total/named) = (10)
Size (desc./elmt) = (10/11)
Dataset datatype #7:
Count (total/named) = (41541/0)
Size (desc./elmt) = (10/8)
Dataset datatype #8:
Count (total/named) = (2/0)
Size (desc./elmt) = (10/29)
Dataset datatype #9:
Count (total/named) = (1/0)
Size (desc./elmt) = (10/50)
Dataset datatype #10:
Count (total/named) = (1/0)
Size (desc./elmt) = (10/40)
Dataset datatype #11:
Count (total/named) = (1/0)
Size (desc./elmt) = (10/27)
Dataset datatype #12:
Count (total/named) = (591/0)
Size (desc./elmt) = (10/20)
Dataset datatype #13:
Count (total/named) = (454/0)
Size (desc./elmt) = (14/1)
Total dataset datatype count: 1292978
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small #of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Free-space persist: FALSE
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
Total # of small size sections: 0
Free-space section bins:
Total # of sections: 0
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 724493320 bytes
Raw data: 5035777471 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 2263753 bytes
Total space: 5762534544 bytes

gheber · March 5, 2021, 11:15pm

OK, we’ve got quite a bit of metadata. There are several lines of inquiry, and maybe we start with the ones that require the smallest effort.

Could you outline the workflow around how H5Iget_name gets invoked? You are calling it 100s or 1,000s of times, right?
We need to understand if the JNI overhead is negligible or a factor. Doing a run w/ gperftools and visualizing in kcachegrind should give us a few clues.
Which version of the file format are you using? If you are using the defaults, you’re probably running the oldest version. To explore the effect of optimizations for large object numbers in newer versions, you can use h5repack and create a repackaged version of your file with the latest file format. (No point in changing your program w/o tangible benefit.)
Are you using any HDF5 1.12 or even 1.10 specific features? If not, can you build and run your app with HDF5 1.8.22?

Let’s maybe start with that?

There are plenty more options (e.g., we can look at the metadata cache size/configuration/performance), but those will require a bit more effort.

Best, G.

spenshel · March 6, 2021, 2:10am

I am developing a program for viewing/editing the contents of an HDF5 file, similar to HDFview. As it reads a link from a source file, it calls H5Iget_name to determine if the link refers to an object in the current group, or is a hard link referring to an object located somewhere else.
I’m guessing gperftools and kcachegrind are Linux-only programs.
I’ll definitely try this.
I like some of the new features of 1.12, including H5O_open_by_token.

gheber · March 6, 2021, 3:10am

To be honest, I don’t understand what you are trying to achieve. Maybe the source of confusion is this:

Groups do not contain objects. If they contain anything they contain links. This might seem pedantic but has a few unexpected consequences:

A given HDF5 object can be linked to multiple groups. In other words, the concept of a parent or predecessor of an HDF5 object is undefined (in general). With respect to H5Iget_name that means that whatever pathname it returns, it could be one of several.

Not only can an object be linked to multiple groups, but it can also be linked, under different names, multiple times to the same group.

Finally, if the object in question is a group, it can even be linked to itself.

I’m saying all that only to caution you against reading too much into what H5Iget_name returns.

Under certain circumstances, for example when your HDF5 (multi-)graph structure is a tree, then, of course, there is exactly one path to each object. But in that case, calling H5Iget_name is overkill, unless you are in a corner of your application where all you have is a handle (hid_t) and you would like to recover a pathname.
It’s overkill, because, as you know already, it’s a very expensive operation, and to catalog all pathnames in the file you could just use H5Ovisit and conveniently pick up the pathnames along the way.

The phrase “… object in the current group …” is a (potentially) misleading way of saying “… object linked to the current group …”

On the other hand, given an object via a handle hnd, it is legitimate to ask if the corresponding object is linked (and how many times) to a given group. The only way to determine that is to retrieve the object’s address or token and examine the destinations (resoluble to addresses or tokens) of all links in the given group. H5Iget_name won’t help, unless you can make additional assumptions about the global structure.

OK?

G.

spenshel · March 8, 2021, 9:13pm

OK, I see I was looking at hard links the wrong way. It’s more convenient to compare one object’s token to another’s to determine if the two objects link to the same location in memory. Thanks for all of your help.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

H5I_get_name call is very slow for HDF5 file > 5 GB