HDFql Batch-Processing Proposal

Hello HDF5 community,

In the past few months, we have been trying to identify use-cases on how HDF5 data is usually (batch-)processed in disparate scenarios. We have also been thinking on how HDFql, a high-level (declarative) programming language to manage HDF5 data, could help in this task.

We have now a proposal (which can be found attached to this post) that introduces an extension to HDFql’s SELECT operation that we would like to share. It basically consists in allowing the SELECT operation to read and (post-)process multiple datasets/attributes potentially held across multiple HDF5 files. The extension will effectively lower the complexity of batch-processing HDF5 data through (the execution of) one single (HDFql) operation while guaranteeing excellent performance and availability of HDF5 functionalities.

We would like now to ask for feedback concerning the proposal and eventually if the typical HDF5 (batch-)processing use-cases your organization faces could be shared. Feel free to post here your feedback/use-cases or to eventually contact us through https://www.hdfql.com/#contact.

Hopefully, the present post will trigger a wider discussion on the topic of (batch-)processing HDF5 data (which seems to be not much discussed) so that not only HDFql but the HDF5 ecosystem as well may benefit from this discussion.

We would like to deeply thank @gheber for his great support and feedback concerning this proposal!

Rick (for the HDFql Team)

hdfql_batch_processing_proposal.pdf (69.9 KB)

Good stuff! What about batch introspection? Let’s say I’m writing a query involving

SELECT FROM /data LIKE **/^test.h5$ ...

How do I determine which files were touched/found? In this instance, if I had access to the file system, I could do this by other means, but it’d be much harder for more general queries.

G.

1 Like

Hi @gheber ,

Great question - glad you asked it! :slight_smile:

The primary goal of the SELECT operation is to retrieve data stored in datasets or attributes - and not really to determine which files this operation eventually touches/finds. Most likely, what you need is to use the SHOW FILE operation for that purpose.

Starting from HDFql version 2.4.0 (i.e. the latest version), the SHOW FILE operation (which retrieves the files stored in the current working directory or in a specific directory, eventually in a recursive way) was extended with the LIKE option. The syntax/semantics of this LIKE is exactly the same we are thinking to implement in the SELECT operation (as part of its future batch-processing capabilities). This means that - and take into the query you have posted - if you execute a statement such as this:

SHOW FILE /data LIKE **/^test.h5$

It returns all the files named test.h5 that are stored in all (sub)directories within root directory /data. In other words, the files returned are exactly the same than the ones the SELECT operation (with batch-processing capabilities) would have touched/found. Your code could then articulate/leverage from the two operations (i.e. SHOW FILE and SELECT) to solve the question you have raised.

Hope this clarifies the situation!

1 Like

What about IN PARALLEL? G.

Hi @gheber,

Yes, the IN PARALLEL option (of the SELECT operation) will support (post-)processing HDF5 data. (We just didn’t specify this option in the canonical representation of the SELECT operation in the attached proposal to keep it succinct.) This means that HDFql will be able to (post-)process HDF5 data in parallel within an MPI context. A simple example to illustrate this support:

SELECT FROM COUNT(dset, 10) IN PARALLEL  ===> this reads dataset ‘dset’ in parallel (using MPI), counts the number of occurrences of 10 in the dataset, and returns this number to MPI rank #0

Hope this helps clarifying the situation!