Issue while using compound datatype with an array element (and h5dump output)

Dear All,

I am using parallel hdf5 to write compound dataset from biological
simulations. I am trying to use compound datatype with an array element
whose size is known at runtime (I see similar question which is not answered
<http://hdf-forum.184993.n3.nabble.com/C-or-C-create-a-compound-datatype-of-double-arrays-with-runtime-size-determination-td4025678.html&gt;\).
I have searched mailing list but didn't see any solution. I am providing
all info with the complete example. Let me know if I am making any mistake
below:

The compound datatype example
<https://www.hdfgroup.org/ftp/HDF5/examples/examples-by-api/hdf5-examples/1_8/C/H5T/h5ex_t_cmpd.c&gt;
in the tutorial write below dataset (this is simple *serial* test example):

typedef struct {
    double temperature;
    double pressure;
    char *location;
    int serial_no;
} sensor_t;

I am slightly modifying this example to write:

typedef struct {
    double temperature;
    double pressure;
    int *location; // note that compile time fixed
size array (e.g. *int location[38]*) works fine!
   int serial_no;
} sensor_t;

Note that the length of location array is known at runtime and *same for
all instances of sensor_t (so I assume I don't need variable length
datatype which is not supported by parallel hdf5 anyway)*.

So I create compound datatype as:

   // first create datatype for an array element

    locdims[0] = nloc; //nloc is size of location array,
    strtype = H5Tarray_create(H5T_NATIVE_INT, 1, locdims);

   // I add size of location array otherwise I get an error: H5T__insert():
member extends past end of compound type
    memtype = H5Tcreate (H5T_COMPOUND, sizeof (sensor_t) +
nloc*sizeof(int));

    status = H5Tinsert (memtype, "Temperature (F)",
            HOFFSET (sensor_t, temperature), H5T_NATIVE_DOUBLE);
    status = H5Tinsert (memtype, "Pressure (inHg)",
                HOFFSET (sensor_t, pressure), H5T_NATIVE_DOUBLE);
    status = H5Tinsert (memtype, "Location", HOFFSET (sensor_t, location),
                strtype);

    /* for serial_no field I can't use HOFFSET (sensor_t, serial_no)
because location
     * is pointer to 1-d array. So I manually calculate offset as:
     */
    int serial_no_offset = HOFFSET (sensor_t, location) + nloc*sizeof(int);

    status = H5Tinsert (memtype, "Serial number",
                serial_no_offset, H5T_NATIVE_INT);

Now if I write and read above dataset through program I get correct values!
But if I try to look at the generated hdf5 file with hdf5dump, it shows
invalid values :

*$ ./a.out *

Dynamic Allocating nlocs 4

DS1[0]:
Serial number : 1153
Location : 0
Location : 10
Location : 20
Location : 30
Temperature (F) : 53.230000
Pressure (inHg) : 24.570000

DS1[1]:
Serial number : 1184
Location : 0
Location : 10
Location : 20
Location : 30
Temperature (F) : 55.120000
Pressure (inHg) : 22.950000

*$ h5dump h5ex_t_cmpd.h5*

HDF5 "h5ex_t_cmpd.h5" {
GROUP "/" {
   DATASET "DS1" {
      DATATYPE H5T_COMPOUND {
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
         H5T_ARRAY { [4] H5T_STD_I32LE } "Location";
         H5T_STD_I32LE "Serial number";
      }
      DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): {
            53.23,
            24.57,
            [ -515879600, 32684, 1153, 32767 ],
            687194767
         },
      (1): {
            6.93572e-310,
            5.84974e-321,
            [ 0, 0, 4, 0 ],
            2
         }
      }
   }
}
}

I have attached test program with this email. The fixed size array example
works fine and you can compile the attached test as:

gcc h5ex_t_cmpd.c -DSTATIC -I/include path lib_link

*To reproduce the above issue with (dynamically allocated array) : *

gcc h5ex_t_cmpd.c -I/include path lib_link

Any help will be appreciated!

Regards,
Pramod

h5ex_t_cmpd.c (5.58 KB)

I am currently learning HDF5, specifically the C++ API, and have run into this exact problem. I know it’s been four years since this question was posted, but if the OP hasn’t yet discovered a good solution, I’d like to throw some ideas back and forth, or at least leave some options here for others who encounter the same problem in the future.

I believe the root of this problem is that pointers and arrays in C/C++ are not the same thing, even if they behave very similarly under most circumstances. We know that the name of a fixed-length array decays to a pointer when needed, i.e.:

int arr[3] = {1, 2, 3};
int* ptr = arr;

However, the reverse is not true:

int arr[3] = {1, 2, 3};
int* ptr;
*ptr = 10;
arr = ptr;

will result in a compile-time error. What is happening in our case is that the HDF5 library expects an array that can decay to a pointer for datasets with a the HDF5 array type, but we’re giving it a raw pointer.

So far I have come up with a couple of workarounds, each with their own drawbacks. The first option I’ve come up with is to declare the array member of your struct with a size greater than what you ever expect to need, e.g.

typedef struct {
    double temperature;
    double pressure;
    int location[256]
    int serial_no;
} sensor_t;

if you expect to never have more than 256 ints. Then, when you define the array datatype, you enter desired length as normal:

strtype = H5Tarray_create(H5T_NATIVE_INT, 1, locdims);

You can then use HOFFSET to compute the offset of all members of the compound type. This is probably the simplest workaround to our problem, but it has one major downside: the HDF5 library will allocate enough disk space to fit 256 ints even if you are writing only 3. This means that any files you write using this workaround will be bigger than they need to be.

The second option I’ve found is to change the member in question to a variable-length type. So, your struct changes to

typedef struct {
    double temperature;
    double pressure;
    hvl_t* location;
    int serial_no;
} sensor_t;

and your HDF5 type definition changes to

strtype = H5Tvlen_create(H5T_NATIVE_INT);

You will then have to handle the memory for the location member as you would for any other variable-length type, including using H5Dvlen_reclaim. Unfortunately, this method also increases the size of the written files, but to a lesser degree than the first method.

The third solution I encountered was to just write and read the members of the compound datatype individually. That is, you first create/read the compound datatype with all of its members and create the corresponding dataset object. Then, redefine the compound datatype with one named member at a time and an offset of zero, and use H5D_WRITE/H5D_READ to write/read the dataset. This method results in the smallest file size of the three methods I’ve found.

I am considering other possible solutions, such as treating the data as character strings, but these three are all I have confirmed so far. I am also considering the impact of these methods on write/read times.

I don’t know if there could be an easier way to do it, but the issue with the original poster is that the dynamically allocated array is not allocated in contiguous memory, while the fixed-length array is (if we forget for the moment padding). So, the H5T_COMPOUND type, which is expecting contiguous structure type (again, forgetting about padding), will not work OK in this case. To make it work one can pack the H5T_COMPOUND datatype, to really get rid of possible padding, then create a serialized version of the wdata array, which then can be written to disk with no problem using H5DWrite. To read, we can do the other way around, reading all the data in disk to a 1D array, which then can be copied to rdata once we konw the structure of the underlying data. It is not pretty, and I wish there was an easier way to do it, but I wouldn’t know how to do it in any other way.

This is what I get when running the code, which I attach as well. (If anybody can think of a more elegant way to do this, it would be great to know how). By the way, doing it this way, there is no wasted storage, as can be seen from the 100% utilization message when using h5ls.

Cheers,
Ángel de Vicente

[angelv@comer test_h5]$ h5pcc h5ex_t_cmpd.c
[angelv@comer test_h5]$ ./a.out 5

 Dynamic Allocating array 5
DS1[0]:
Serial number   : 1153
Location        : 0
Location        : 10
Location        : 20
Location        : 30
Location        : 40
Temperature (F) : 53.230000
Pressure (inHg) : 24.570000

DS1[1]:
Serial number   : 1184
Location        : 0
Location        : 10
Location        : 20
Location        : 30
Location        : 40
Temperature (F) : 55.120000
Pressure (inHg) : 22.950000

[angelv@comer test_h5]$ h5dump h5ex_t_cmpd.h5
HDF5 "h5ex_t_cmpd.h5" {
GROUP "/" {
   DATASET "DS1" {
      DATATYPE  H5T_COMPOUND {
         H5T_IEEE_F64LE "Temperature (F)";
         H5T_IEEE_F64LE "Pressure (inHg)";
         H5T_ARRAY { [5] H5T_STD_I32LE } "Location";
         H5T_STD_I32LE "Serial number";
      }
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): {
            53.23,
            24.57,
            [ 0, 10, 20, 30, 40 ],
            1153
         },
      (1): {
            55.12,
            22.95,
            [ 0, 10, 20, 30, 40 ],
            1184
         }
      }
   }
}
}
[angelv@pilas test_h5]$ h5ls -vlr h5ex_t_cmpd.h5
Opened "h5ex_t_cmpd.h5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/DS1                     Dataset {2/2}
    Location:  1:800
    Links:     1
    Storage:   80 logical bytes, 80 allocated bytes, 100.00% utilization
    Type:      struct {
                   "Temperature (F)"  +0    native double
                   "Pressure (inHg)"  +8    native double
                   "Location"         +16   [5] native int
                   "Serial number"    +36   native int
               } 40 bytes
[angelv@pilas test_h5]$

h5ex_t_cmpd.c (4.7 KB)