HDF5-UDF 1.2 released, allowing translation of CSV to HDF5


#1

Hello there, folks!

I’m happy to announce the availability of HDF5-UDF 1.2. You probably remember it, but it doesn’t hurt to state that the tool allows embedding routines written in C/C++, Python or Lua on HDF5 files in a way that such routines execute each time the dataset is read.

This new release comes with some exciting new features over the previous version:

  • Support for outputting string datatypes
  • Support for outputting compound datatypes, which may include string elements and native datatypes
  • Using 1-based indexing on the Lua API to conform with the language recommendations

What’s really special about support for generating compounds and strings is that it is possible to translate CSV to HDF5 on-the-fly, and it’s easy to do so, as the examples below will show.

Please refer to the project page at GitHub for download instructions and more examples. Feedback is welcome as usual and pull requests even more!

CSV to HDF5 with HDF5-UDF

Snippet of albumlist.csv
Number,Year,Album,Artist,Genre
1,1967,Sgt. Peppers Lonely Hearts Club Band,The Beatles,Rock
2,1966,Pet Sounds,The Beach Boys,Rock
3,1966,Revolver,The Beatles,Rock
User-Defined Function (yes, it’s this simple)
def dynamic_dataset():
    udf_data = lib.getData("GreatestAlbums")
    with open("albumlist.csv") as f:
        # Skip the header
        f.readline()

        for i, line in enumerate(f.readlines()):
            # Split the line using "," as separator
            elements = [col.strip("\n") for col in line.split(",")]

            # Generate compound members on-the-fly
            udf_data[i].id = int(elements[0])
            udf_data[i].year = int(elements[1])
            lib.setString(udf_data[i].album, elements[2])
            lib.setString(udf_data[i].artist, elements[3])
            lib.setString(udf_data[i].genre, elements[4])
Command to embed the User-Defined Function on HDF5
$ hdf5-udf file.h5 dynamic_dataset.py \
  'GreatestAlbums:{id:int32,year:int16,album:string(40),artist:string,genre:string}:500'
First few entries of the dynamically generated HDF5 dataset
$ h5dump -O -d /GreatestAlbums file.h5
   (0): {
         1,
         1967,
         "Sgt. Peppers Lonely Hearts Club Band",
         "The Beatles",
         "Rock"
      },
   (1): {
         2,
         1966,
         "Pet Sounds",
         "The Beach Boys",
         "Rock"
      },
   (2): {
         3,
         1966,
         "Revolver",
         "The Beatles",
         "Rock"
      },
    ...