pHDF5 1.12: trivial data transform function (expression = "x") and optimized parallel I/O routines

jan-willem.blokland · August 16, 2021, 7:56am

Hello,

Currently, we are working on building a library on top of parallel version of HDF5 library. In this library we make extensive use of the data transform functionality which allows us to apply a unit conversion when for example writing a physical quantity to disk. For example an user has in memory the velocity in meter / sec while on disk he/she wants it to be in feet / sec.

While looking into the code, I noticed there is a clear distinction between the case when the data transform function is set or not. At least that is my impression. In the case the data transform is not set, the code makes use of optimized parallel I/O routines assuming that the other conditions, like no data type conversion, are satisfied too. If I understand the code correctly, this means that for a trivial data transform function (expression = “x”) the code does not make use of these optimized parallel I/O routines.

So my question is there a way to make use of these optimized parallel I/O routines in the case the expression = “x”? If not, it may be worthwhile to extend the code for this case by making for example H5Pset_data_transform() more sophisticated or create a function to unset the previously set data transform function. In my small test, I noticed a significant performance difference when using the optimized routines compared to non-optimized ones.

Best regards,
Jan-Willem

contact · August 16, 2021, 9:53am

Hi @jan-willem.blokland ,

Unfortunately, we don’t have much experience with function H5Pset_data_transform - I hope someone can chime in and help clarifying your issue though!

That said, would you mind to describe what kind of (HDF5) data processing/transformation are you thinking to implement at your organization? That would be great.

We are currently implementing a new (pre-/post-) processing engine in HDFql to allow users to process/transform (HDF5) data as they need, and your feedback could help steering this engine in the right direction. This processing engine is based on two kind of functions: pre-defined and user-defined. To get a more precise idea:

pre-defined functions (these are implemented by HDFql such as MIN, MAX, AVG, COUNT that, whenever possible, automatically use all nodes and cores available to speed-up computation - e.g. SELECT FROM MAX(my_dataset) —> this makes HDFql to read dataset my_dataset and return its maximum value to the user)
user-defined functions (these are implemented by users as shared libraries that are dynamically loaded (at runtime) and used by HDFql to process/transform data in specific ways - e.g. SELECT FROM MYFUNC(my_dataset) —> this makes HDFql to 1) load a shared library named HDFqlMYFUNC.{dll|so|dylib}, 2) pass the data of my_dataset to the user-defined function implemented in the library, and 3) return the result of calling the function to the user)

Thank you very much for the feedback!

jan-willem.blokland · August 16, 2021, 1:30pm

Hello,

We are using the function H5Pset_data_transform in the following setting.

Suppose we have 3 applications called A_writer, B_reader, C_qc. Each of the applications have their own unit system.

Let say A_writer computes grid of velocity values. These values have the unit of meter / sec. These values are needed by application B_reader and this applications will be executed multiple times. However, B_reader expects these values to be in feet / sec instead meter / sec. This is where the H5Pset_data_transform function comes into the picture. We write the velocity values computed by A_writer to HDF5 while using a data transform function such that the velocities are converted from meter / sec to feet / sec. In the HDF5 we also store the unit system of the stored velocities.
Now, B_reader can just read these values and there is no unit conversion needed.
The application C_qc also reads the HDF5 file created by A_writer but this applications expects the velocities to be meter / sec. Since we store the unit information of the velocities application C_qc request to read these velocities but wants to store these values as meter / sec in memory. Again, we use the H5Pset_data_transform function to apply the unit conversion for feet / sec to meter / sec.

This is one of the features of the library which we are developing. In a sense we make a very clear distinction between data that is stored in HDF5 and in memory. Of course these unit conversions become trivial when the units in HDF5 are the same as the ones in memory.

Best regards,
Jan-Willem

gheber · August 16, 2021, 3:59pm

I think this is just a shortcut in the current implementation. The scope of data transformations under H5Pset_data_transform is so narrow (per-element, no datatype conversion, no dynamic allocations, etc.), that it might just work. But the code needs careful analysis to avoid any unintended consequences. On this end, unfortunately, we’ll have to wait until someone frees up to take a look.

Best, G.

gheber · August 16, 2021, 4:02pm

If you are already thinking about more complex transformations, this might be a good time to read up on HDF5-UDF.

G.

contact · August 16, 2021, 8:10pm

Thanks for the insightful feedback on how you are processing/transforming (HDF5) data at your organization @jan-willem.blokland!

jan-willem.blokland · August 17, 2021, 7:40am

@contact, you are welcome. This unit conversion between different applications is always tricky, therefore we working on implementing a general solution for it.

@gheber, the data transform functionality works nicely for us. Even with its current limitations. It is a powerful feature to have. Furthermore, I think I found a way to leverage the optimized parallel I/O routines in the case of a trivial data transform function (expression = “x”). If all the tests run fine, I will make a pull request of it.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

pHDF5 1.12: trivial data transform function (expression = "x") and optimized parallel I/O routines