Reading compound datasets recursively

cmccarthy1 · March 26, 2020, 3:47pm

Hi,

I’ve been trying to solve a problem which requires me to be able to read a compound datatype without knowledge of the underlying struct. I’m aware that this question has been asked on this forum in several forms over the years but I haven’t been able to parse out the information that will allow me to solve the problem.

In this I’m assuming for now that the datatypes which make up the compound dataset are integer/float or strings and all fixed length. Essentially I need to be able to visit every named field and read the data from the associated buffer. I have been able to get the lists of fields and their associated types but any help on the read side would be appreciated

gheber · March 26, 2020, 5:00pm

In addition, you know the number of elements and their size. You’ll have to decide if you want to 1) read the fields into separate arrays or 2) read whole records.

The scheme for 1) goes like this:

FOREACH field F IN compound
  CREATE a compound type with F as its ONLY field
  (the type of the field is the native type T of the field type you have discovered for that field)
  ALLOCATE an array of type T and size of the number of elements you want to read
  CALL H5Dread with these arguments (in-memory compound, buffer, etc.)

As a result, you will have number-of-fields arrays of the right type and you’re good to go.

The scheme for 2) is potentially more error-prone, but not super-complicated:

DETERMINE the size S of the native version T of your compound 
ALLOCATE a byte array A of size S times the number of elements you want to read
CALL H5Dread with these arguments.

As a result, you’ll have an array that has all the bytes you need (but not the structure!).
To parse this byte array you’ll need a few pre-defined reader functions that can read a certain number of bytes from an offset and convert them to the expected field type (integer, float, fixed-length string, etc.). If you have space, you can do that to construct separate arrays as under 1), but you’d potentially need twice the space.

Does that make sense/help?
G.

cmccarthy1 · March 26, 2020, 7:28pm

Thanks for the considered response!

That makes perfect sense I believe that the first scheme should work for my use case. I will revert here with the implementation completed once I’ve gotten it completed.

Thanks again

cmccarthy1 · March 29, 2020, 2:44pm

Hi Gerd,

I was able to get all the numeric values read correctly but I’m having some issues with reading string datatypes. The following is the code that I’ve been using to insert into the compound type and read into the associated array (rdata)

char **rdata;
memtype = H5Tcopy(H5T_C_S1);
H5Tset_size(memtype, H5T_VARIABLE);
test_cmpd = H5Tcreate(H5T_COMPOUND, sizeof(char *));
H5Tinsert(test_cmpd, H5Tget_member_name(native_type, i), 0, memtype);
rdata = (char **)malloc(dims[0] * sizeof(char *));
H5Dread(data, test_cmpd, H5S_ALL, H5S_ALL, H5P_DEFAULT, rdata);

The following are the errors that I’m getting.

  #000: H5T.c line 1795 in H5Tequal(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
  #000: H5Tcompound.c line 354 in H5Tinsert(): unable to insert member
    major: Datatype
    minor: Unable to insert object
  #001: H5Tcompound.c line 446 in H5T__insert(): member extends past end of compound type
    major: Datatype
    minor: Unable to insert object
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
  #000: H5Dio.c line 199 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: H5Dio.c line 467 in H5D__read(): unable to set up type info
    major: Dataset
    minor: Unable to initialize object
  #002: H5Dio.c line 983 in H5D__typeinfo_init(): unable to convert between src and dest datatype
    major: Dataset
    minor: Feature is unsupported
  #003: H5T.c line 4546 in H5T_path_find(): can't find datatype conversion path
    major: Datatype
    minor: Can't get value
  #004: H5T.c line 4762 in H5T__path_find_real(): no appropriate function for conversion path
    major: Datatype
    minor: Unable to initialize object

Any insight that you can give on how to fix this would be extremely appreciated

gheber · March 30, 2020, 12:49pm

(Maybe for reference you can share the output of h5dump -pBH for that dataset with us?)

For starters, the code looks odd because you are using H5Tget_member_name to get the field name, but you construct the field type by hand. Why? (BTW, you have a memory leak right here, because you’re supposed to free (H5free_memory!) the array returned by H5Tget_member_name.)

Use H5Tget_member_type (combined w/ H5Tget_native_type) to retrieve/insert the field type. You must check for HDF5 string datatypes whether you are dealing with a fixed- or a variable-length datatype. This is crucial because the read buffer you’ll allocate is either of type char* (fixed) or char** (variable). Since you told us in the original post that you are dealing with fixed-length strings, your current rdata and compound would be wrong.

G.

cmccarthy1 · March 30, 2020, 1:59pm

Here’s the dataset that I’m testing on.

h5dump -pBH testdata/ex_table_02.h5 
HDF5 "testdata/ex_table_02.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   DATASET "table" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "Name";
         H5T_STD_I32LE "Latitude";
         H5T_STD_I32LE "Longitude";
         H5T_IEEE_F32LE "Pressure";
         H5T_IEEE_F64LE "Temperature";
      }
      DATASPACE  SIMPLE { ( 10 ) / ( H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 10 )
         SIZE 400
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 6;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "FIELD_0_NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "FIELD_1_NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 9;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "FIELD_2_NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 10;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "FIELD_3_NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 9;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "FIELD_4_NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 12;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "TITLE" {
         DATATYPE  H5T_STRING {
            STRSIZE 12;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "VERSION" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
   }
}
}

Construction of the type from scratch was based on some other examples that I had seen rather than retrieving from the dataset. I’ve skipped over a number of if/else statements etc to check for the types of the datasets that I’m reading. The following precedes the code displayed in the previous post

memb_cls = H5Tget_member_class(native_type, i);
test_type = H5Tget_member_type(dtype, i);
else if(memb_cls == H5T_STRING){
      if(H5Tequal(test_type, H5T_VARIABLE)){

Apologies for confusion around the use of fixed length I had meant that the number of messages for each field would be the same not explicitly that the types of the fields would be fixed length. So the solution will need to handle both fixed length and variable length strings

Thanks for highlighting the memory leak this will be fixed

gheber · March 30, 2020, 3:29pm

No problem. In this particular example,

DATATYPE H5T_COMPOUND {
    H5T_STRING {
        STRSIZE 16;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1; } "Name";
    ...
}

you are looking at a fixed-length string (STRSIZE 16).

I cannot recommend using H5Tequal to determine whether you are dealing with a variable-length string,
because it might give you false negatives. (There is more than one variable-length string datatype!) The function to use is H5Tis_variable_str (after determining the datatype class).

G.

cmccarthy1 · April 10, 2020, 10:31pm

Hi Gerd,

Thanks for all your help so far! I have that working well now thanks to your guidance. I’m now working on the write side of the interface.

Similar to the read example I’m wondering if it is possible to generalise the writes to compound datasets? I can create the compound dataset typed correctly template but don’t have the underlying struct, rather the data in my case is a list of lists associated with each field of unknown types from the interfaces perspective. Is it possible to write to the compound datasets by field name rather than using the expected write from the struct?

Any information on solving this is greatly appreciated.

gheber · April 14, 2020, 12:18pm

Yes, you can do the same thing in reverse, i.e., you can write a compound dataset field-by-field. Just define an in memory compound type with a single field and make sure the name matches the field name in the dataset in the file, and do the write. Only the selected elements and field will be updated. (In HDF5, we call this “partial I/O.”)

Best, G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Reading compound datasets recursively