HDF5-UDF 1.0 released


#1

Greetings!

I’m happy to announce the first stable release of HDF5-UDF. HDF5-UDF is a tool that enables the procedural generation of datasets using user-defined functions (UDFs). It is possible to access the network and connect to sensors, web services, and more – and expose that data as HDF5 datasets that regular tools can readily consume.

This first release provides support for UDFs in Python, C/C++, and Lua – all of which execute in a sandboxed environment that restricts access to system resources. Moreover, HDF5-UDF is shipped as a filter, which means that no modifications are needed to applications that read from HDF5 files.

The project comes with several examples that should help you get started. It’s even possible to embed the classic Doom game as an HDF5 dataset, for instance!

Please refer to the installation page for details on how to install it from the binary package or to build it from the source code.

I hope you enjoy it!
Lucas


#2

This sounds interesting. Is it possible to access data stored in a compound type either directly via lib.getdata or by indexing the variables representing the datasets?


#3

Thanks! I’ve have a local branch with a work in progress that adds support to compound datasets. What I do, underneath, is to lookup the compound member names/types and to convert them into a C-like structure that can be used to iterate and access specific entries. For instance, the following compound:

HDF5 "example-compound.h5" {                                          
GROUP "/" {
   DATASET "DS1" {
      DATATYPE  H5T_COMPOUND {
         H5T_STD_I64LE "Serial number";
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
   }
}

is automatically converted into the following structure (mind the member name changes so that they become valid C variable names):

struct compound_ds1 {
  int64_t serial_number; 
  double temperature; 
  double pressure; 
};

It’s possible, then, to create a shiny new UDF that uses that structure. In the example below the UDF exports the Temperature member of that compound as a dynamic dataset:

extern "C" void dynamic_dataset()
{
    auto compound = lib.getData<compound_ds1>("DS1");
    auto udf_data = lib.getData<float>("Temperature");
    auto udf_dims = lib.getDims("Temperature");

    for (size_t i=0; i<udf_dims[0]; ++i)
        udf_data[i] = compound[i].temperature);
}

There are some minor details that I’m still working on. For instance, HDF5 allows the memory layout to be different from the disk layout, and for the time being the code assumes them to be the same. I’ll look into fixing that some time today. HDF5-UDF also doesn’t deal with big endian byte ordering (as of yet). Last, there’s no support for nested compounds nor for compound members that are declared as arrays. None of these are actual limitations in the code base; it’s more of a matter of implementing them.


#4

Just as an update, I’ve been able to complete handling the memory layout vs disk layout when converting the compound into a C structure. This means that, when needed, proper padding will be automatically introduced into the structure, as in this example:

struct compound_dataset1 {
  int64_t serial_number; 
  char _pad0[16];
  double temperature; 
  double pressure; 
};

This initial prototype has support for compounds when writing UDFs in C/C++. Python and Lua will be handled in a future changeset.

If you’re interested in trying it out, please refer to this Git branch. Instructions on how to build the code from the source code are given here. I also wrote an example UDF that shows how to extract a compound member and expose it as if it were a regular dataset. For convenience, this is how it looks:

HDF5 "example-compound.h5" {
GROUP "/" {
   DATASET "Dataset1" {
      DATATYPE  H5T_COMPOUND {
         H5T_STD_I64LE "Serial number";
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
      }
      DATASPACE  SIMPLE { ( 1000 ) / ( 1000 ) }
   }
   DATASET "Temperature" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1000 ) / ( 1000 ) }
   }
}
}

Dataset1 is the compound taken as input to the UDF, and Temperature is dynamically generated by the sample source code I just mentioned.

I hope this is useful to you somehow.


#5

Thanks so much. This is really promising, since I was looking for a way to do this for quite some time. I’m in the middle of a hundred things, so won’t be able to test it this week, but hopefully I can have a chance to get your code and try it with one of our datasets next week. I’ll get back to you.

Cheers


#6

Sounds good. I have just added support for compounds to the Python backend, too, so it’s gonna be easier for you to try it out once you have a chance:

def dynamic_dataset():
    compound = lib.getData("Dataset1")
    udf_data = lib.getData("Temperature")
    udf_dims = lib.getDims("Temperature")

    for i in range(udf_dims[0]):
        udf_data[i] = compound[i].temperature