HDF5-UDF 2.0 released: UDF signing, trust profiles, Python bindings, and more!


#1

I’m thrilled to announce that a new major release of User-Defined Functions for HDF5 is now available. If you haven’t heard of it before, HDF5-UDF is a tool that allows embedding routines written in C/C++, Python or Lua on HDF5 files in a way that such routines execute each time the dataset is read.

This is a major improvement over the previous version and comes with so many good things it’s hard to believe! Here are some of the good bits:

  • UDF signing: UDFs are now automatically signed when saved to HDF5. When a user reads that UDF dataset for the first time, we extract and associate the public key behind that UDF to a trust profile that limits which system calls that UDF can execute – and which file system paths it can access.

  • UDF library: HDF5-UDF is now available as a library so you no longer have to use the hdf5-udf command-line utility to attach your UDFs to HDF5 files.

  • Python bindings: it’s now possible to programmatically compile and store UDFs from Jupyter Notebooks and regular Python scripts

  • Source code storage: by default, HDF5-UDF compiles the source code and stores its resulting bytecode on the HDF5 file. Now it’s also possible to include the source code so the UDF can be modified and recompiled in the future.

  • New build system: we now use the Meson + Ninja build system to compile and install the code from its source code.

Please refer to the project page at GitHub for download instructions and more details on the project. Feedback is welcome as usual!


#2

This sounds interesting. How portable is this? It sounds it depends on Linux security features and on python. How far is this functionality from being C/C++ only and usable under Windows? Is the purpose to generate data sets procedurally or to execute “embedded programs” stored in HDF5? It sounds like the latter, so I wonder what uses cases are under consideration?


#3

Thanks! Porting HDF5-UDF to other operating systems should not be too difficult. Given that it’s possible to disable sandboxing, one should be able to get basic functionality and then progressively incorporate security-related features.

The codebase is primarily written in C++. Python is one of the programming languages you can use to write UDFs (the other two are Lua and C/C++) and, starting with this release, the first one to provide bindings of HDF-UDF’s main library.

On the purpose of UDFs: they are meant to extend HDF5 by letting one to generate datasets procedurally. There are several use cases:

  • Data virtualization: use HDF5 as interface for files in other formats such as CSV and GeoTIFF

  • Gateway for IoT devices: embed the logic to retrieve live data from sensors and arrange them as if they were static HDF5 datasets

  • Storage and network bandwidth savings: if you have a dataset C that’s produced by combining datasets A and B, then just attach that logic as a compiled UDF that will grow the HDF5 file by just a few KBs

  • Process data where data lives: UDFs bring HDF5 one step closer to computational storage

  • Process data when it’s needed: some data ingestion pipelines try to preprocess data in advance with hopes that the produced data will be used at some point. When such preprocessing scripts are attached as UDFs, data is only processed if the application requests it (i.e., when the UDF dataset is read)

  • Keep your scripts next to the data they process: never lose track of which scripts produced a given dataset

These are just some examples that should give you an idea of the power of UDFs. Please let me know if you have any more questions.