Announcement: programmable datasets with Lua


#1

On behalf of my team, I’m happy to announce the first release of HDF5-UDF: user-defined-functions for HDF5. The project enables the embedding of Lua scripts in HDF5 so that users can programmatically define a dataset whose data is generated on-the-fly each time that dataset is read.

The primary motivation for this project is to dramatically reduce the disk space used by datasets that are a variation of existing data. We have successfully used HDF-UDF to virtually eliminate the impact of derived data in a number of use cases; grids that used to take a few gigabytes on disk, uncompressed, now require just a couple of kilobytes.

Underneath, the source code is converted to a bytecode representation that LuaJIT executes when the dataset is read by the application. Through Just-In-Time compilation the overhead of virtualization is barely noticed: outputting grids that have no dependency on existing datasets can be an order of magnitude faster than reaching out to disk for I/O.

We invite everyone to try it out and to open pull requests. We hope you find it as useful as we do.

Thanks,
Lucas


#2

Nice work, Lucas! An obvious use case might be testing. How would you read/evaluate a dataset w/ “UDF layout” via the C-API? (Of course, you can call Lua from C…). What are the potential safety issues and what kind of protective mechanisms can be put in place? Intrigued, G.


#3

Hi, Gheber!

HDF5-UDF ships as a filter, so you use the same API as usual to retrieve datasets.

Once compiled, the UDF is written to disk in the filter callback. We also write metadata (i.e., its data type, size, and dataset dependencies). When H5Dread is called the filter retrieves the bytecode and the metadata. Next, it allocates the output dataset and passes it to the Lua engine that executes the bytecode and populates that dataset.

We configure LuaJIT’s sandbox so that it only a few core modules can be accessed by the UDF script: math, string, table, and others. We try to keep that list as small as possible to reduce the likelihood of having malicious UDF scripts attempting to escape the sandbox. On the other hand, we understand that some users may want to use UDF in controlled environments, so security rules could be more relaxed. We’re actively working on this topic at the moment.

One particular feature that’s currently disabled due to security concerns is access to the network module. Even though it allows really cool things (such as creating a dataset that under the hoods connects to a weather forecasting service and retrieves the most recent temperature or precipitation grids), we wanted to keep this initial version more strict.

Thanks for your feedback!
Lucas


#4

Lucas, I was able to follow your instructions under Debian 10 with the latest HDF5 development branch. Very nice!

I think there’s a minor error in the documentation. If I run hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:float32, it complains that

Datatype 'float32' is not supported
Failed to parse string 'SineWave:100x10:float32'

However, hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:float and hdf5-udf sine_wave.h5 sine_wave.lua SineWave:100x10:double work just fine.

Best, G.


#5

It looks like I mixed float and int32 when writing the example. Thanks for the fix!

Best regards,
Lucas