C++ Read Compound dataset without taking static structure for different fieldnames


#1

I have multiple H5 files and i am creating a generic code where i need to read specific field defined in input whether it will be an array or double or int. i can’t take like fixed struct s1_t like below because it has defined for specific h5 file.

typedef struct s1_t {
int a;
float b;
double c;
float arr[ARRSIZE];
} s1_t;

let say one file having data like this as per below :

image

another file having the data like this as below:

image

I am using Visual studio 19 in windows with C++17 standard and HDF5 1.12.2 C++ wrapper.

any help will be appreciated.

Thank you in advance.


#2

For a compile-time option, have a look at h5cpp-compiler. For runtime options, you’ll have to use the datatype introspection functions in the H5T module, i.e., discover the field names/types at runtime and then construct the type you want to read.

G.


#3

Can be done, but IMHO this is special case, where you have to bear all the consequences of this non-standard approach. This means:

  • taking @gheber’s suggestion to use the T type discovery interface from the C API
  • hand roll the code up to the level it looks ‘standard’
  • possible loss of performance

Alternative is to ‘hammer’ the input into a format until all of them has the same type/shape, because than others can see what you did. To do that I would use Python, Julia, anything that is flexible; of course under the right hands one can be as productive with c++ as with python – but often this requires some time.

If I may ask:

  1. how many datasets are you dealing with
  2. are they changing or static (ie: provided once)
  3. what is the size of the total bytes transferred?

steve


#4

Hi Gheber & Steven,

I have discovered the datatype introspection functions in the H5T modules to read the fieldnames and Field Type at runtime but I am stuck at constructing the memory type needed to read the fields using the already existing syntax, if you could give me a code snippet or example of how it is done in C++ will be helpful.

Like Steven said, we will have loss of performance in it but we are planning to read only specific fields to reduce the memory size and we are planning find out specific rows of dataset and storing them in a vector and to be sent to GUI.

To answer Steven’s questions

  1. we have hundred’s of different datasets with different types of fields as shown in the pictures, to be read but not all datasets, at the same time we have to read only one dataset at a time .
  2. Even though one file has max 4-5 datasets we will only read one dataset at a time and all datasets are different from each other. we are working on a code to read as many different datatypes as it can but only one dataset will be read at a time.

For example,
If I have 5 datasets like A, B, C,D,E i will have to read only on dataset A at a time the one common point
is that all the datasets are of the compound datatype only content only changes.

  1. the total size of bytes to be transferred can vary from 1 megabyte to 16 Gigabytes

#5

Hi @er_akhilesh15,

If you know how many different compound data types you have across your multiple existing HDF5 files/datasets, you could have these represented/declared through multiple struct in your C++ code. Afterwards, you could use the appropriate struct (using the H5T module as suggested by @gheber) in function of the compound data type when processing the dataset in question.

If you don’t know how many compound structures there is or what are they a priori, or if there is too many (making it impractical to have innumerous C++ struct), you could eventually use HDFql and leverage from its cursor functionality to solve this. A cursor in HDFql abstracts you from the underlying data type/structure details making it easy to retrieve data without a priori knowledge. One can use HDFql function helpers to help better interpret/process the retrieved data afterwards. Example:

// retrieve data from dataset 'my_compound' (this dataset can be of any data type/structure)
HDFql::execute("SELECT FROM my_compound");

// traverse the cursor and display the data in function of the data type/structure 
// if a structure, its members will be flattened including members of nested structures
while (HDFql::cursorNext() == HDFql::Success)
{
    if (HDFql::cursorGetDataType() == HDFql::Int)
    {
        std::cout << "This is an integer: " << *HDFql::cursorGetInt() << std::endl; 
    }
    else if (HDFql::cursorGetDataType() == HDFql::Float)
    {
        std::cout << "This is a float: " << *HDFql::cursorGetFloat() << std::endl; 
    }
    else
    {
        std::cout << "Bypassing data" << std::endl; 
    }
}

Hope it helps!


#6

We talked about the issue at yesterday’s HDF clinic (see Tips, tricks, & insights). The datatype introspection algorithm goes like this:

  1. Retrieve the field’s in-file datatype via H5Tget_member_index and H5Tget_member_class/H5Tget_member_type
  2. Retrieve the in-memory (or native) datatype via H5Tget_native_type
  3. Determine the in-memory size via H5Tget_size
  4. Construct an in-memory compound datatype via H5Tcreate/H5Tinsert
  5. Allocate a buffer of the right size
  6. H5Dread & parse the buffer

G.


#7

Thank you @contact for the reply. As you said I wanted to read read data with out any prior knowledge of the different underlying datatype but I have found that the HDFql library [.LIB] files and [.DLL] files are not compatible with my current version of VC++ compiler and the source code to build those DLL is also not available in the installed files, so I am now opting back to @gheber’s idea of exploring the filed types and declaring different types of struct 's for each different datasets , I would have to declare hundreds of structs to create and allocate buffer and read the dataset fields.

Thank you all for the reply if there is a community for maintaining HDFql I would urge them to create the DLL’s and LIB files for the newer version of Visual Studio and VC++ compiler’s if possible.


#8

Hi @er_akhilesh15,

Thanks for the feedback.

It is indeed possible to use HDFql shared library (i.e. .dll) in a program compiled with any VC++ compiler version (including the one shipped with Visual Studio 19). What is unfortunately not possible is to use HDFql static library (i.e. .lib) in a program compiled with a compiler version different than the one used to compile the static library (currently, we distribute HDFql static libraries compiled in Visual Studio 2010, 2013 and 2015 for Windows).

Because of this limitation and the effort required to solve it (by compiling HDFql with many different compiler versions and for all the platforms supported by it, namely: Windows, Linux and macOS), we are planning to remove HDFql static libraries from future releases so that it may ultimately create less issues/confusion.

Hope it helps!


#9

100’s of datasets is interesting: would it be possible to upload 5 - 10 of them, with a small amount of data in it, say 20 rows or so?
It is possible to generate the data structures for c (or arbitrary languages)
steve


#10

As per attached images we are getting output as per your comments @contact . but here we have the field names e.g. a, b & c and we want to get the data from specific field and store it in the appropriate datatype container. can we write a query like "SELECT a/b/c FROM ‘ms_data’ "? or if you provide a code snippet to read specific field data and store in container that will be helpful to us.


#11

Hi @er_akhilesh15,

Currently, HDFql doesn’t allow a per member operation when it comes to writing or reading a compound (as announced in section ISSUES of its release notes). In other words, HDFql only supports writing or reading all members of a compound for the time being.

That said, we have plans to extend the logic of the SELECT and INSERT operations to support a per member operation. It will be available in a new release of HDFql next year. Some examples that illustrates this extension:

 SELECT FROM compound.member1 ===> reads member 'member1' belonging to compound 'compound'

 INSERT INTO dset.m1.m2 VALUES(10) ===> writes value '10' into member 'm2' belonging to nested compound 'm1' of compound 'dset'

Hope it helps!