I have browsed the web already but I am still confused about this.
As I understand this, HDF5 does not provide any mechanisms for indexing and querying data.
I am thinking about querying a word in an array of strings or just iterating over an array in a specific sort order.
I could index the data myself and then use basic algorithms such as binary search etc. but I want to use existing tools and standards.
This project is a Java application.
Can anyone help me with that or point me in the right direction?
Correct, if by ‘HDF5’ you mean ‘the HDF5 library.’ There are community projects, such as PyTables, that support querying and indexing, and a few research projects, but the “native” selection mechanisms supported by the HDF5 library are based on the position of array elements and not on their value. See also Dataspaces and Data Transfer in the HDF5 User Guide.
I have investigated all the links but I don’t see how Elasticsearch can be put on top of a local HDF5 file in a desktop Java application. The examples seem to be server-client and web-based.
My use-case is an HDF5 file that contains a String array and I would like to search in that String array with a Java API. If creating an index with that same library is necessary, that’s fine.
EDIT:
I was looking at Spark but I didn’t find any way to connect it to an HDF5 file.
Since you don’t want client-server solution and want 100% Java, you may want to look at Apache Lucene for indexing:
I hope that you can write a simple Java code for reading String arrays from HDF5 and code for feeding them to Apache Lucene.
Regarding Apache Drill query, please submit the issue to Apache Drill community. Someone will answer your question and make a patch for HDF5 driver if necessary.
If you’re using netcdf-java and still interested in Spark,
you may also want to look at the latest top-level project from Apache community: