Announcement: programmable datasets with Lua

On behalf of my team, I’m happy to announce the first release of HDF5-UDF: user-defined-functions for HDF5. The project enables the embedding of Lua scripts in HDF5 so that users can programmatically define a dataset whose data is generated on-the-fly each time that dataset is read.

The primary motivation for this project is to dramatically reduce the disk space used by datasets that are a variation of existing data. We have successfully used HDF-UDF to virtually eliminate the impact of derived data in a number of use cases; grids that used to take a few gigabytes on disk, uncompressed, now require just a couple of kilobytes.

Underneath, the source code is converted to a bytecode representation that LuaJIT executes when the dataset is read by the application. Through Just-In-Time compilation the overhead of virtualization is barely noticed: outputting grids that have no dependency on existing datasets can be an order of magnitude faster than reaching out to disk for I/O.

We invite everyone to try it out and to open pull requests. We hope you find it as useful as we do.

Thanks,
Lucas

2 Likes

Nice work, Lucas! An obvious use case might be testing. How would you read/evaluate a dataset w/ “UDF layout” via the C-API? (Of course, you can call Lua from C…). What are the potential safety issues and what kind of protective mechanisms can be put in place? Intrigued, G.

Hi, Gheber!

HDF5-UDF ships as a filter, so you use the same API as usual to retrieve datasets.

Once compiled, the UDF is written to disk in the filter callback. We also write metadata (i.e., its data type, size, and dataset dependencies). When H5Dread is called the filter retrieves the bytecode and the metadata. Next, it allocates the output dataset and passes it to the Lua engine that executes the bytecode and populates that dataset.

We configure LuaJIT’s sandbox so that it only a few core modules can be accessed by the UDF script: math, string, table, and others. We try to keep that list as small as possible to reduce the likelihood of having malicious UDF scripts attempting to escape the sandbox. On the other hand, we understand that some users may want to use UDF in controlled environments, so security rules could be more relaxed. We’re actively working on this topic at the moment.

One particular feature that’s currently disabled due to security concerns is access to the network module. Even though it allows really cool things (such as creating a dataset that under the hoods connects to a weather forecasting service and retrieves the most recent temperature or precipitation grids), we wanted to keep this initial version more strict.

Thanks for your feedback!
Lucas

Lucas, I was able to follow your instructions under Debian 10 with the latest HDF5 development branch. Very nice!

I think there’s a minor error in the documentation. If I run hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:float32, it complains that

Datatype 'float32' is not supported
Failed to parse string 'SineWave:100x10:float32'

However, hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:float and hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:double work just fine.

Best, G.

It looks like I mixed float and int32 when writing the example. Thanks for the fix!

Best regards,
Lucas

Hello! Here’s a new major feature that’s worth being announced.

HDF5-UDF now supports the creation of dynamic datasets in C/C++. Just like with the LuaJIT backend, users can write a callback function in plain C/C++ that’s compiled into a shared library, embedded in the HDF5 file, and executed on-the-fly once that dataset is read from the application using the standard HDF5 APIs.

One of the cool features that comes with this is support for network sockets out-of-the-box. For instance, it is possible to use HDF5 datasets as a proxy to data served by web services or stored in other formats. Here’s an example that retrieves a bitmap from the internet and converts it into an HDF5 dataset.

Please refer to the README file for more technical details. Bug reports, pull requests and questions are welcome as usual. Thanks!

2 Likes

Hi, folks! For those of you interested, it is now possible to write dynamic datasets in Python. The user-defined-function that generates the dataset is compiled into a bytecode form (.pyc) and embedded in the HDF5 file. When the application requests to read that file we then load that bytecode, perform the required bindings, and execute the user-defined-function that populates the HDF5 dataset expected by the application.

This initial implementation depends on Python 3 and on the CFFI module. I have a gut feeling that it’s possible to convert the HDF5 malloc()'d buffer into a NumPy array (and back), but I didn’t investigate enough. For now, the user-defined-function sets/gets data as if they were a one-dimensional array – please see this example for details.

If you have time and interest, please test and provide feedback. Thanks!

1 Like