HDF5-UDF 1.0 released

Greetings!

I’m happy to announce the first stable release of HDF5-UDF. HDF5-UDF is a tool that enables the procedural generation of datasets using user-defined functions (UDFs). It is possible to access the network and connect to sensors, web services, and more – and expose that data as HDF5 datasets that regular tools can readily consume.

This first release provides support for UDFs in Python, C/C++, and Lua – all of which execute in a sandboxed environment that restricts access to system resources. Moreover, HDF5-UDF is shipped as a filter, which means that no modifications are needed to applications that read from HDF5 files.

The project comes with several examples that should help you get started. It’s even possible to embed the classic Doom game as an HDF5 dataset, for instance!

Please refer to the installation page for details on how to install it from the binary package or to build it from the source code.

I hope you enjoy it!
Lucas

2 Likes

This sounds interesting. Is it possible to access data stored in a compound type either directly via lib.getdata or by indexing the variables representing the datasets?

Thanks! I’ve have a local branch with a work in progress that adds support to compound datasets. What I do, underneath, is to lookup the compound member names/types and to convert them into a C-like structure that can be used to iterate and access specific entries. For instance, the following compound:

HDF5 "example-compound.h5" {                                          
GROUP "/" {
   DATASET "DS1" {
      DATATYPE  H5T_COMPOUND {
         H5T_STD_I64LE "Serial number";
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
   }
}

is automatically converted into the following structure (mind the member name changes so that they become valid C variable names):

struct compound_ds1 {
  int64_t serial_number; 
  double temperature; 
  double pressure; 
};

It’s possible, then, to create a shiny new UDF that uses that structure. In the example below the UDF exports the Temperature member of that compound as a dynamic dataset:

extern "C" void dynamic_dataset()
{
    auto compound = lib.getData<compound_ds1>("DS1");
    auto udf_data = lib.getData<float>("Temperature");
    auto udf_dims = lib.getDims("Temperature");

    for (size_t i=0; i<udf_dims[0]; ++i)
        udf_data[i] = compound[i].temperature);
}

There are some minor details that I’m still working on. For instance, HDF5 allows the memory layout to be different from the disk layout, and for the time being the code assumes them to be the same. I’ll look into fixing that some time today. HDF5-UDF also doesn’t deal with big endian byte ordering (as of yet). Last, there’s no support for nested compounds nor for compound members that are declared as arrays. None of these are actual limitations in the code base; it’s more of a matter of implementing them.

Just as an update, I’ve been able to complete handling the memory layout vs disk layout when converting the compound into a C structure. This means that, when needed, proper padding will be automatically introduced into the structure, as in this example:

struct compound_dataset1 {
  int64_t serial_number; 
  char _pad0[16];
  double temperature; 
  double pressure; 
};

This initial prototype has support for compounds when writing UDFs in C/C++. Python and Lua will be handled in a future changeset.

If you’re interested in trying it out, please refer to this Git branch. Instructions on how to build the code from the source code are given here. I also wrote an example UDF that shows how to extract a compound member and expose it as if it were a regular dataset. For convenience, this is how it looks:

HDF5 "example-compound.h5" {
GROUP "/" {
   DATASET "Dataset1" {
      DATATYPE  H5T_COMPOUND {
         H5T_STD_I64LE "Serial number";
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
      }
      DATASPACE  SIMPLE { ( 1000 ) / ( 1000 ) }
   }
   DATASET "Temperature" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1000 ) / ( 1000 ) }
   }
}
}

Dataset1 is the compound taken as input to the UDF, and Temperature is dynamically generated by the sample source code I just mentioned.

I hope this is useful to you somehow.

Thanks so much. This is really promising, since I was looking for a way to do this for quite some time. I’m in the middle of a hundred things, so won’t be able to test it this week, but hopefully I can have a chance to get your code and try it with one of our datasets next week. I’ll get back to you.

Cheers

1 Like

Sounds good. I have just added support for compounds to the Python backend, too, so it’s gonna be easier for you to try it out once you have a chance:

def dynamic_dataset():
    compound = lib.getData("Dataset1")
    udf_data = lib.getData("Temperature")
    udf_dims = lib.getDims("Temperature")

    for i in range(udf_dims[0]):
        udf_data[i] = compound[i].temperature

Hi,
just had time now to try HDF5-UDF. I installed it inside a singularity container, and tried one of the examples in the code, which worked as expected:

Singularity> h5dump -d VirtualDataset example-add_datasets.h5 | head -n 10
HDF5 "example-add_datasets.h5" {
DATASET "VirtualDataset" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100, 50 ) / ( 100, 50 ) }
   DATA {HDF5 "example-add_datasets.h5" {
DATASET "VirtualDataset" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100, 50 ) / ( 100, 50 ) }
   DATA {
   (0,0): 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
Singularity> 

But when I try the same in the host, I have an “unable to print data” error. Any idea what could I be missing?

angelv@sieladon:~/.../Utilities/HDF5-UDF$ h5dump -d Dataset1 example-add_datasets.h5 | head -n 6
HDF5 "example-add_datasets.h5" {
DATASET "Dataset1" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100, 50 ) / ( 100, 50 ) }
   DATA {
   (0,0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
angelv@sieladon:~/.../Utilities/HDF5-UDF$ h5dump -d VirtualDataset example-add_datasets.h5 | head -n 6
h5dump error: unable to print data
HDF5 "example-add_datasets.h5" {
DATASET "VirtualDataset" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100, 50 ) / ( 100, 50 ) }
   DATA {
   }

Many thanks

Hi!

One information you didn’t mention is which UDF you’re using (i.e., the one based on Python, Lua or C/C++?). C++ is going to require libstdc++ on the host. Lua will require luajit. Python will require libpython3.x.so. In any case, I would have expected an error message to be shown when one of the dependencies could not be resolved. I will have a look to make sure I’m not missing any error reporting.

Hi,
this was the Python example.

In the host I’m using Python3.8 installed via anaconda, and the libpython3 library is there:

angelv@sieladon:~/.../Utilities/HDF5-UDF$ python --version
Python 3.8.6
angelv@sieladon:~/.../Utilities/HDF5-UDF$ conda env list
# conda environments:
#
                         /home/angelv/.julia/conda/3
base                  *  /home/angelv/local/prog_langs/anaconda3
chianti                  /home/angelv/local/prog_langs/anaconda3/envs/chianti
chianti8                 /home/angelv/local/prog_langs/anaconda3/envs/chianti8
python_hpc               /home/angelv/local/prog_langs/anaconda3/envs/python_hpc

angelv@sieladon:~/.../Utilities/HDF5-UDF$ find /home/angelv/local/prog_langs/anaconda3/lib -name 'libpython3*'
/home/angelv/local/prog_langs/anaconda3/lib/libpython3.so
/home/angelv/local/prog_langs/anaconda3/lib/libpython3.8.so
/home/angelv/local/prog_langs/anaconda3/lib/libpython3.8.so.1.0
angelv@sieladon:~/.../Utilities/HDF5-UDF$

Could you try to export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/angelv/local/prog_langs/anaconda3/lib/ and retry? Also, do you have HDF5_PLUGIN_PATH set to the location where libhdf5-udf.so is saved?

Out of curiosity, which version of Python are you running inside the container? Another useful test is to run the C++ example so we know if the problem is indeed related to Python or not.

Hi, the issue was the HDF5_PLUGIN_PATH, which I had not set outside the container. After setting that, this example works just fine.

I will try to do more complex tests in the coming days and I’ll let you know if I have any issues.

Many thanks,

1 Like

It’s great that it’s working for you now! Please let me know how it goes. Have fun!