Unable to create a compound datatype with variable length byte arrays


#1

I am trying to create a compound datatype with variable length arrays. When created and tested to work in hypy, the format looks like this using h5dump:

HDF5 "camera_data.h5py" {
GROUP "/" {
   GROUP "observations" {
      DATASET "0" {
         DATATYPE  H5T_COMPOUND {
            H5T_IEEE_F64LE "timestamp";
            H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } "bgr";
            H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } "d";
         }
         DATASPACE  SIMPLE { ( 39000 ) / ( 39000 ) }
         DATA {
         (0): {
               1.60865e+09,
               [ (255, 216, 255, 224, 0, 16, 74, 70, 73, 70, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 255, 219, 0, 67, 0, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 4, 3, 2, 2, 2, 2, 5, 4, 4, 3, 4, 6, 5, 6, 6, 6, 5, 6, 6, 6, 7, 9, 8, 6, 7, 9, 7, 6, 6, 8, 11, 8, 9, 10, 10, 10, 10, 10, 6, 8, 11, 12, 11, 10, 12, 9, 10, 10, 10, 255, 219, 0, 67, 1, 2, 2, 2, 2, 2, 2, 5, 3, 3, 5, 10, 7, 6, 7, 
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 255, 192, 0, 17, 8, 1, 224, 3, 80, 3, 1, 34, 0, 2, 17, 1, 3, 17, 1, 255, 196, 0, 31, 0, 0, 1, 5, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 255, 1
96, 0, 181, 16, 0, 2, 1, 3, 3, 2, 4, 3, 5, 5, 4, 4, 0, 0, 1, 125, 1, 2, 3, 0, 4, 17, 5, 18, 33, 49, 65, 6, 19, 81, 97, 7, 34, 113, 20, 50, 129, 145, 161, 8, 35, 66, 177, 193, 21, 82, 209, 240, 36, 51, 98, 114, 130, 9, 10, 22, 23, 24, 25, 26, 37, 38, 39, 40, 41, 42, 52, 53, 54, 55, 56, 57, 58, 67, 68, 69, 70, 71, 72, 73, 74, 83, 84, 85, 86, 87, 88, 89, 90, 99, 100, 101, 102, 103
, 104, 105, 106, 115, 116, 117,

Can you please tell me how to create such a compound datatype using the H5 C++ APIs. Couldnt find an example to repurpose, so I am asking here:

DATATYPE H5T_COMPOUND {
H5T_IEEE_F64LE “timestamp”;
H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } “bgr”;
H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } “d”;
}


#2

Hi @manish.kochhal,

One way to create such compound dataset could be with HDFql (assuming that you don’t have C++ API constraints) as follows:

HDFql::execute("create file camera_data.h5py");

HDFql::execute("create dataset camera_data.h5py observations/0 as compound(timestamp as double, bgr as unsigned vartinyint(1), d as unsigned vartinyint(1))(39000)");

Hope it helps!


#3

Unfortunately, I cannot use Proprietary software like HDFql. I am using H5Cpp interface and wanted to know how I can create a compound datatype using that interface that has one double and two variable length byte arrays.


#4

You can implement this custom functionality with seamless C interop between H5CPP and HDF5 CAPI. The project is under MIT license, comes with an LLVM based compiler assisted introspection tool.

The current uploaded version doesn’t yet support out of the box what you are asking, but easily can be written in C, then used from C++.

Are you trying to load/save massive dataset this way? Is there alternative representation, perhaps with fixed sized array fields? – fixed sized are supported with the compiler tool, you don’t have to do anything.


#5

I am trying to save RGB data and depth data coming from a camera every 33 ms into an HDF5 file. The RGB is 3 bytes/pixel and depth is 2 bytes/pixel. Depending on the chosen resolution, I can probably fix the size of the array.

However, if I were to save PNG compressed RGB image and ZSTD (or ZIP) compressed depth data into the HDF5 file, the byte array sizes will different based on the image and depth data and compression efficiency. Since we wish to save storage space, we will try to save compressed data in HDF5 file but for that I need to accommodate variable length byte arrays.


#6

I was able to use the H5Cpp interface to generate an HDF5 format such as this:

HDF5 "camera_2020-12-22_08h08m05.h5" {
    GROUP "/" {
       GROUP "observations" {
          DATASET "0" {
             DATATYPE  H5T_COMPOUND {
                H5T_IEEE_F64LE "timestamp";
                H5T_VLEN { H5T_STD_U8LE} "bgr";
                H5T_VLEN { H5T_STD_U8LE} "d";
             }
             DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
             DATA {
             (0): {
                   257.261,
                   (41, 76, 2, 117, 168, 122, 116, 119, 96, 86, 25, 183, 242, 115, 235, 130),
                   (202, 45, 13, 43, 245, 97, 45, 30, 3, 133, 226, 162, 182, 241, 89, 194)
                },

However, you will see that the data-type in the compound datatype for the variable length array shows up as:

H5T_VLEN { H5T_STD_U8LE} "bgr";
H5T_VLEN { H5T_STD_U8LE} "d";

instead of:

        H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } "bgr";
        H5T_ARRAY { [1] H5T_VLEN { H5T_STD_U8LE} } "d";

The C++ code looks like this:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>


const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "0";
const std::string group_name = "observations";
const std::string file_name = "camera_2020-12-22_08h08m05.h5";

struct CompoundData {
    double timestamp;
    hvl_t  bgr_values;
    hvl_t  depth_values;
    CompoundData(double ts) : timestamp(ts) {};
};


int main () {
    /*
     * Create the named file, truncating the existing one if any,
     * using default create and access property lists.
    */
    H5::H5File* file = new H5::H5File(file_name, H5F_ACC_TRUNC);

    /*
     * Create a group in the file
    */
    H5::Group* group = new H5::Group( file->createGroup(group_name));
    
    H5::DataSpace *dataspace = new H5::DataSpace(n_dims, &n_rows);

    // target dtype for the file
    H5::CompType data_type(sizeof(CompoundData));
    data_type.insertMember("timestamp", HOFFSET(CompoundData, timestamp), H5::PredType::NATIVE_DOUBLE);
    data_type.insertMember("bgr", HOFFSET(CompoundData, bgr_values), H5::VarLenType(H5::PredType::NATIVE_UCHAR));
    data_type.insertMember("d", HOFFSET(CompoundData, depth_values), H5::VarLenType(H5::PredType::NATIVE_UCHAR));

    H5::DataSet* dataset = new H5::DataSet(file->createDataSet("/"+group_name+"/"+dataset_name, data_type, *dataspace));

    // one vector holding the actual data
    std::vector<std::vector<uint8_t>> bgr_values;
    bgr_values.reserve(n_rows);
    std::vector<std::vector<uint8_t>> depth_values;
    depth_values.reserve(n_rows);

    // and one holding the hdf5 description and the "simple" columns
    std::vector<CompoundData> data;
    data.reserve(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 255.0);
    std::poisson_distribution<hsize_t> poisson(20);

    for (hsize_t idx = 0; idx < n_rows; idx++) {
        hsize_t size = poisson(gen);
        bgr_values.emplace_back();
        depth_values.emplace_back();
        bgr_values.at(idx).reserve(size);
        depth_values.at(idx).reserve(size);

        for (hsize_t i = 0; i < size; i++) {
            bgr_values.at(idx).push_back((int)normal(gen));
            depth_values.at(idx).push_back((int)normal(gen));
        }

        // set len and pointer for the variable length descriptor
        data.emplace_back(normal(gen));
        data.at(idx).bgr_values.len = size;
        data.at(idx).bgr_values.p = (void*) &bgr_values.at(idx).front();
        data.at(idx).depth_values.len = size;
        data.at(idx).depth_values.p = (void*) &depth_values.at(idx).front();
    }

    dataset->write(&data.front(), data_type);

    delete dataset;
    delete dataspace;
    delete group;
    delete file;

    return 0;
}

#7

how so?

Image sensor is fixed sized, the simplest and a fastest data model – applying law of parsimony – is a hypercube with N bit element type.

The extendable dataset is organised into chunks, once you perused the C API (and the source code) they support compression.
this packet table comes with a fast implementation, exactly doing that.

Then again, you seem to know what you are doing – don’t want to interrupt.
best wishes: steven


#8

If I am saving raw images and raw depth data in HDF5, then I can fix the array sizes easily (as discussed before). However, that is not efficient in terms of storage space. So we need to compress image and depth, which leads to variable sized byte arrays (i.e. the compressed output). Here is where we need the ability to specify variable length byte arrays in a compound datatype for storage in HDF5 dataspace.


#9

Thanks for the packet table example. I did not knew that compression is supported.