Data Structure Indexes and Querying

dseide · February 21, 2023, 2:13pm

Hi all,

I have browsed the web already but I am still confused about this.

As I understand this, HDF5 does not provide any mechanisms for indexing and querying data.
I am thinking about querying a word in an array of strings or just iterating over an array in a specific sort order.

I could index the data myself and then use basic algorithms such as binary search etc. but I want to use existing tools and standards.

This project is a Java application.

Can anyone help me with that or point me in the right direction?

Thanks,
David

gheber · February 21, 2023, 2:48pm

Correct, if by ‘HDF5’ you mean ‘the HDF5 library.’ There are community projects, such as PyTables, that support querying and indexing, and a few research projects, but the “native” selection mechanisms supported by the HDF5 library are based on the position of array elements and not on their value. See also Dataspaces and Data Transfer in the HDF5 User Guide.

G.

hyoklee · February 21, 2023, 5:17pm

@dseide, Elasticsearch is the way to go.

Earthdata Search is a production-grade example.

Apache Spark is another example of using Elasticsearch.

Some integration example scripts are available here:

It would be great if you can start working on Apache Beam for HDF5:

I hope I gave you enough pointers to start with.

dseide · February 22, 2023, 8:50am

Thanks for your reply.
Unfortunately. PyTables is a Python project, it can not be used in a Java application.

dseide · February 22, 2023, 8:55am

Thanks for your reply.

I have investigated all the links but I don’t see how Elasticsearch can be put on top of a local HDF5 file in a desktop Java application. The examples seem to be server-client and web-based.

My use-case is an HDF5 file that contains a String array and I would like to search in that String array with a Java API. If creating an index with that same library is necessary, that’s fine.

EDIT:
I was looking at Spark but I didn’t find any way to connect it to an HDF5 file.

Now, I was able to query an HDF5 with Apache Drill but I can not query its datasets.
I followed this tutorial:
https://drill.apache.org/docs/hdf5-format-plugin

Do you know how to query the String arrays?

I wasn’t able to find anything about indexing neither with Spark nor with Drill which makes me believe this is not what I am looking for.

There must be a Java API to query a dataset inside an HDF5 file.

hyoklee · February 22, 2023, 2:49pm

Elasticsearch can be installed as standalone.

Since you don’t want client-server solution and want 100% Java, you may want to look at Apache Lucene for indexing:

I hope that you can write a simple Java code for reading String arrays from HDF5 and code for feeding them to Apache Lucene.

Regarding Apache Drill query, please submit the issue to Apache Drill community. Someone will answer your question and make a patch for HDF5 driver if necessary.

If you’re using netcdf-java and still interested in Spark,
you may also want to look at the latest top-level project from Apache community:

https://sedona.apache.org/latest-snapshot/

Finally, thank you so much for sharing a very interesting problem!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Data Structure Indexes and Querying