C++ Read h5 cmpd (/struct) dataset that each field is vector


#1

see title
I am having difficulty trying to load my h5 file compound dataset that each field is a vector, using C++ H5 (not C)

here is what my compound dataset looks like
typedef struct {
std::vector FieldA, FieldB, FieldC;
}TestH5;
TestH5.FieldA = {1.0,2.0,3.0,4.0}; % similar to FieldB, FieldC

Can any H5 C++ expert point me on how to read this testH5 into the backend using C++?
(do I need to malloc memory for each field using c++? do I need to do as for loop?)

Thanks a lot


#2

The expression below is a [templated] Class datatype in C++, placed in a non-contiguous memory location, requiring scatter-gather operators and a mechanism to dis-assemble reassemble the components. Becuase of the complexity AFAIK there is no automatic support for this sort of operation.

template <typename T>
struct TestH5{
    std::vector<T> FieldA, FieldB, FieldC;
};

The structure above maybe modelled in HDF5 in the following way:

  • /group/[fieldA, fieldB, fieldC] fast indexing by columns, more complex and slower indexing by rows; also easier read/write from julia/python/R/C/ etc…
  • by a vector of tuples: std::vector<std::tuple<T,T,T>> where you work with a single dataset, fast indexing by rows and slower indexing by columns
  • exotic custom solution based on direct chunk write/read: fast indexing of blocks by row and column wise at the increased complexity of the code.

H5CPP provides mechanism for the first two solutions:

TestH5<int> data = {std::vector<int>{1,2,3,4}, std::vector<int>{5,6,7}, std::vector<int>{8,9,10}};

h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
h5::write(fd, "/some_path/fieldA", data.fieldA);
h5::write(fd, "/some_path/fieldB", data.fieldB);
h5::write(fd, "/some_path/fieldC", data.fieldC);

Ok the above is simple and well behaved, the second solution needs a POD struct backing, as tuples are not supported in the current H5CPP version (the upcoming will support arbitrary STL)

struct my_t {
   int fieldA;
   int fieldB;
   int fieldC;
}  

You can have any data types and arbitrary combination in the POD struct, as long as it qualifies as POD type in C++. This approach involves H5CPP LLVM based compiler assisted reflection – long word, I know; sorry about that. the bottom line you need the type descriptor and this compiler does it for you, without lifting a pinky.

std::vector<my_t> data;
h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
h5::write(fd, "some_path/some_name", data);

This approach is often used in event recorders, hence there is this h5::append operator to help you out:

h5::ds_t ds = h5::open(...);
for(const auto& event: event_provider)
   h5::append(ds, event);

Both of the layouts are used to model sparse matrices, the second resembling COO or coordinate of points, whereas the first is for Compressed Sparse Row|Column format.

slides are here, the examples are here.
best wishes: steve


#3

Hi Steve
Thanks for the quick reply

I was able to save/write compound data, but I can not load/read that compound dataset back

Is there anyway to read vector-based field back?
can H5 read be able to directly load vector based field from a compound dataset directly?


#4

H5CPP IO operators are symmetric: all reads and writes are invertible. AFAIK: there is no ‘vector based field’ in HDF5.

  • array dataype: this is how the H5CPP newgen maps std::array<T,N>
  • variable length types, this is how H5CPP newgen maps std::vector<T,Container> where Container:={std::vector, std::string, ... }
  • fixed length types: struct | elementary types

layouts:

  • compact H5CPP newgen maps initalizer lists : {1,3,4,5,5} or anything below 64K if requested or we know at compile time it will not grow beyond 64K
  • contiguous:
  • chunked data of some type: this is how H5CPP currently maps std::vector | linalg | T

If you upload a small trivial dataset I will write an example for you.
best: steve


#5

h5ex_t_cmpd_03_15_22.h5 (6.5 KB)

hi steven
Here is the sample dataset
Dataset ‘DS1’ is the compound dataset, it has 5 members:
FieldA is double based vector
FieldB is string vector
FieldArray is std::vector<std::vector < double > >, each one is a double vector with length of 5
FieldC is a double vector
FieldArray2 same type as FieldArray

this h5 file is written inside MATLAB, in MATLAB vector and arrays has been treated the same

is there any way I could load this h5 into c++ code, using based H5 Cpp APIs?

Best and thanks a lot for the help

EDIT: here is how the structure looks like:

HDF5 h5ex_t_cmpd_03_15_22.h5 
Group '/' 
    Dataset 'DS1' 
        Size:  4
        MaxSize:  4
        Datatype:   H5T_COMPOUND
            Member 'FieldA':  H5T_IEEE_F64LE (double)
            Member 'FieldB':  H5T_STRING
                String Length: variable
                Padding: H5T_STR_NULLTERM
                Character Set: H5T_CSET_ASCII
                Character Type: H5T_C_S1
            Member 'FieldArray':  H5T_ARRAY
                Size: 5
                Base Type:  H5T_IEEE_F64LE (double)
            Member 'FieldC':  H5T_IEEE_F64LE (double)
            Member 'FieldArray2':  H5T_ARRAY
                Size: 5
                Base Type:  H5T_IEEE_F64LE (double)
        ChunkSize:  []
        Filters:  none

#6

Can you modify FieldB such that it is a fixed length character string; or factor it out entirely? This can be done directly, but this descriptor has a NON POD representation in C++; meaning you will need scatter/gather

so can you do:

template <typename T, int M, int N>
struct my_t {
   T FieldA;
   char FieldB[M] ;
   T Array_01[N];
   T Array_02[N]
}

or is it fixed, and you have to live with it?


#7

sure thing, I could factor this out as an individual dataset or attributes, we could skip the string part

EDIT: here is the updated h5 with no string field (only FieldA, FieldB for double array, and FieldArray, FieldArray2 for array of array)
h5ex_t_cmpd_03_15_22_UpdatedNoString.h5 (2.5 KB)


#8

will fix this later, you can download the project from this github page

#include <iostream>
#include <vector>
#include "struct.h"
#include <h5cpp/core>
	// generated file must be sandwiched between core and io 
	// to satisfy template dependencies in <h5cpp/io>  
	#include "generated.h"
#include <h5cpp/io>


int main(){
	h5::fd_t fd = h5::create("test.h5", H5F_ACC_TRUNC);
	{ // this is to create the dataset
		h5::ds_t ds = h5::create<sn::record_t>(fd, "/path/dataset", h5::max_dims{H5S_UNLIMITED} );
		// vector of strings as attribute:
		ds["attribute"] = {"first","second","...","last"};
		
		h5::pt_t pt = ds; // convert to packet table, you could go straight from vector as well
		for(int i=0; i<3; i++)
			h5::append(pt,
			// this is your pod struct 
			sn::record_t{1.0 * i, 2.0 *i ,{1,2,3,4,5},{11,12,13,14,15}});
	}

	{ // read entire dataset back
		h5::ds_t ds = h5::open(fd, "/path/dataset");

		std::vector<std::string> attribute = h5::aread<
			std::vector<std::string>>(ds, "attribute");
		std::cout << attribute <<std::endl;
		// dump data
		for( auto rec: h5::read<std::vector<sn::record_t>>(ds, "/path/dataset")) // this is your HPC loop
			std::cerr << rec.A <<" ";
		std::cerr << std::endl;
	}
}

the generated type descriptor:

/* Copyright (c) 2018 vargaconsulting, Toronto,ON Canada
 *     Author: Varga, Steven <steven@vargaconsulting.ca>
 */
#ifndef H5CPP_GUARD_NKohX
#define H5CPP_GUARD_NKohX

namespace h5{
    //template specialization of sn::record_t to create HDF5 COMPOUND type
    template<> hid_t inline register_struct<sn::record_t>(){
        hsize_t at_00_[] ={5};            hid_t at_00 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_00_);
        hsize_t at_01_[] ={5};            hid_t at_01 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_01_);

        hid_t ct_00 = H5Tcreate(H5T_COMPOUND, sizeof (sn::record_t));
        H5Tinsert(ct_00, "A",	HOFFSET(sn::record_t,A),H5T_NATIVE_DOUBLE);
        H5Tinsert(ct_00, "B",	HOFFSET(sn::record_t,B),H5T_NATIVE_DOUBLE);
        H5Tinsert(ct_00, "array_00",	HOFFSET(sn::record_t,array_00),at_00);
        H5Tinsert(ct_00, "array_01",	HOFFSET(sn::record_t,array_01),at_01);

        //closing all hid_t allocations to prevent resource leakage
        H5Tclose(at_00); H5Tclose(at_01); 

        //if not used with h5cpp framework, but as a standalone code generator then
        //the returned 'hid_t ct_00' must be closed: H5Tclose(ct_00);
        return ct_00;
    };
}
H5CPP_REGISTER_STRUCT(sn::record_t);

#endif

And here is the example dataset:

HDF5 "test.h5" {
GROUP "/" {
   GROUP "path" {
      DATASET "dataset" {
         DATATYPE  H5T_COMPOUND {
            H5T_IEEE_F64LE "A";
            H5T_IEEE_F64LE "B";
            H5T_ARRAY { [5] H5T_IEEE_F64LE } "array_00";
            H5T_ARRAY { [5] H5T_IEEE_F64LE } "array_01";
         }
         DATASPACE  SIMPLE { ( 3 ) / ( H5S_UNLIMITED ) }
         DATA {
         (0): {
               0,
               0,
               [ 1, 2, 3, 4, 5 ],
               [ 11, 12, 13, 14, 15 ]
            },
         (1): {
               1,
               2,
               [ 1, 2, 3, 4, 5 ],
               [ 11, 12, 13, 14, 15 ]
            },
         (2): {
               2,
               4,
               [ 1, 2, 3, 4, 5 ],
               [ 11, 12, 13, 14, 15 ]
            }
         }
         ATTRIBUTE "attribute" {
            DATATYPE  H5T_STRING {
               STRSIZE H5T_VARIABLE;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_UTF8;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
            DATA {
            (0): "first", "second", "...", "last"
            }
         }
      }
   }
}
}


#9

Thank you @steven for the above example
I have noticed one thing, seems like you have split the vector FieldA, FieldB into each struct
it was struct.FieldA = {1,2,…,n}
with above example, it now became: struct(1).FieldA = 1, struct(2).FieldA =2… struct(n).FieldA = n

may I ask why you have split the vector?

Appreciate for the help


#10

Below is the dataset type-dump of the uploaded h5ex_t_cmpd_03_15_22_UpdatedNoString.h5; In the dump we have a composite type of compound type in a contiguous rank 1 data space of size 4. The composite type is made of two fields of atomic type: FieldA and FieldB and two composite type of array type of length 5 and elements of H5T_IEEE_F64LE atomic type.

HDF5 doesn’t recognise vector type, a data structure of rank 1; instead it provides mechanism to represent homogeneous datasets of rank 0 upto rank 32[^1] through data spaces. This data space marking is represented in your uploaded data set as: DATASPACE SIMPLE { ( 4 ) / ( 4 ) } and it tells the rank (1), the current dimension (4) and maximum dimension (4).

C++ is a language with a set of popular data structures such as std::vector<T, ...> template; other libraries such as armadillo, eigen3, blits, blaze also support rank 1 data structures or vectors. In H5CPP by default, rank 1 datasets are mapped to HDF5 atomic or compound types with chunked layout; mimicking the behaviour of std::vector<T, ...> This construct h5::read<std::vector<sn::record_t>>(ds, "/path/dataset")) does the inverse map from HDF5 hyperslabs to std::vector<element_t> where element_t happens to be a pod struct. Hope it makes sense?

HDF5 "h5ex_t_cmpd_03_15_22_UpdatedNoString.h5" {
GROUP "/" {
   DATASET "DS1" {
      DATATYPE  H5T_COMPOUND {
         H5T_IEEE_F64LE "FieldA";  // <-- this is NOT a VECTOR
         H5T_IEEE_F64LE "FieldB";
         H5T_ARRAY { [5] H5T_IEEE_F64LE } "FieldArray";
         H5T_ARRAY { [5] H5T_IEEE_F64LE } "FieldArray2";
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) } // <-- nor is this a vector, it is a rank 1 dataset with 4 elements
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 384
         OFFSET 2144
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}

[^1]: Dataspace rank is restricted to 32, the standard limit in C on the rank of an array, in the current implementation of the HDF5 Library.


#11

thank you @steven for the explanation, will give it a try with your above suggested solution

it sounds to me the FieldA, FieldB arrays have been stored discontinuously in memory??

(ps:It will be really nice if STL could be supported in H5 later since the vector is so powerful for modern c++)


#12

Nope, they are stored contiguously (in-memory). G.


#13

then should not we extract the entire array contents once for each field for best performance

Or, there is no way to get the entire array operation RN?

(sorry guys, I am just trying to understand what’s the limits of H5)


#14

The memory layout for most C++ objects is beyond the scope of this forum; with the exception of plain old datatype or POD structs and STL objects which guarantee contiguous layout. When it comes to POD structs not only the layout matters, but alignment as well. C++ has a relaxed approach to memory layouts, just because the fields are next to each other it doesn’t mean there are no gaps in between. In fact even POD struct fields are aligned to some boundary, unless you tell the compiler otherwise. (non standard layout types don’t even guarantee the order)

Here is how it could be stored with 64bit alignment, using std::vector which guarantees contiguous memory layout: FieldA, FieldB := A,B
{ABxxxxxyyyyy}{ABxxxxxyyyyy}{...}{ABxxxxxyyyyy} where {...} denotes a single element of the std::vector<sn::record_t>, x and y are array elements.


#15

thanks @steven for this quick reply


#16

Here is a solution with scatter/gather where you break up the IO ops then delegate them to H5CPP:

template <class T>
 class record_t {
    std::vector<T> A, B, C;
};
namespace ns {
   template <class T, class... args_t> void write(const h5::fd_t fd,
           const std::string& path, const record_t<T>& payload,   args_t... args){
       h5::write(fd, path + "/A", payload.A, args...);
       h5::write(fd, path + "/B", payload.B, args...);
       h5::write(fd, path + "/C", payload.C, args...);
   }

   template <class T, class... args_t> record_t read(const h5::fd_t fd, 
           const std::string& path, record_t<T>& payload,   args_t... args){
       h5::read(fd, path + "/A", payload.A, args...);
       h5::read(fd, path + "/B", payload.B, args...);
       h5::read(fd, path + "/C", payload.C, args...);
   }
}

#17

(after stopping think about the vector)
using the pointer to load values (similar to solution 1 @steven proposed above)
I am finally able to get what I want to load using struct array like struct(1).FieldA, struct(2).FieldA… (my h5 is compound dataset with each field as 1xN or m x N array)

thanks @steven for clarifying that vector not supported currently, and also appreciate for all the help

in the meanwhile, I will continue working on solution 2 @steven has proposed above to try to load as vector like


#18

Just to clarify: std::vector<T> is supported by H5CPP as it has been from the very beginning (somewhere 2012). Non standard layout types for now are not supported by the h5cpp compiler; however the specific solution can be written by any reasonably skilled C++ software writer; my previous post is a possible solution with a calling pattern – notice the optional arguments may be still passed in arbitrary order.

ns::write(fd, "some/path", object, 
   [,const hsize_t* offset] [,const hsize_t* stride] ,const hsize_t* count [, const h5::dxpl_t dxpl ]); 

To avoid further confusion one side of the code snippet is re-posted here:

01 namespace ns {
02  template <class T, class... args_t> void write(const h5::fd_t fd,
03           const std::string& path, const record_t<T>& payload,   args_t... args){
04       h5::write(fd, path + "/A", payload.A, args...);
05       h5::write(fd, path + "/B", payload.B, args...);
06       h5::write(fd, path + "/C", payload.C, args...);
07 }

The lines 04-06 are trivial to write by hand (and always has been), to break the payload up with meta programming, and generate these lines are another cup of tea.


#20

Please read the thread, find the section where I say H5CPP supports std::vector<T> and std::string alternatively check out these slides
or you could just take my word for it. The STL is a bigger set than {std::vector<T>, std::string} it required a more complex core to make that happen, and as of now H5CPP doesn’t support the full STL. (we don’t even have an agreement how to map std::map<K,V> across platforms and many other objects, but that doesn’t mean you can’t have a non standardised specific implementation with the technique I posted)