Extendable Compound Datasets


#1

Hi, Is it possible to extend a compound dataset? Say add a row to a table? Would there be any examples around

Regards,
/P


#2

No. The datatype is part of the so-called dataset creation properties, which cannot be changed after a dataset has been created. You can mimic this behavior though, if you choose a “columnar layout”, i.e., instead of storing a dataset of records you store a group of columns. New column? -> Just add another colum dataset.
Delete a column? -> You see where this goes. G.


#3

Thanks, that’s one way to go. I was wondering if it’s possible to have an array of compound datatypes? If so what would that look like in terms of setting it up?

I’m running a sim and getting data coming in of different types. I don’t have any idea how much data will be coming in so can’t predict array sizes etc.

Would it be possible to join one dataset to another in a new larger dataset?

say:
int index = 1;

//old dataset from sim data
dataset1
//new dataset created from sim data
dataset2

//create new dataset from dataset1 & dataset2
index++
s1_tid = H5Tcreate(H5T_COMPOUND, sizeof(myDataStruct * index));

//…and so on with every new dataset created?

Not sure if that would work, how would you write the data? Add the structs to an array and write the array out?

Any ideas would be appreciated.

Thanks again
/P


#4

OK, I misunderstood/misread. You don’t wanna add a column (modify the type), but add rows. Yes, that’s rather straightforward. Just create a so-called extendible dataset and add rows as you go. See here for an example. The example is for a scalar type but works the same way for compound or any other datatype.

OR have a look at H5CPP’s h5::append operator. See the scalar pod example.

G.


#5

Thanks, I’m on the right track however I’m having issues with the hyperslab setup. I’m not sure what values should be sent to selectHyperslab. I’ve got some test code, and I’m writing a struct of ints successfully however when I try to extend the dataset and write the new struct it is overwriting the old data and adding some garbage values. However the structure is ok.
.
Was wondering if you could give your thoughts. I know I’m messing offsets up or the memory space.

typedef struct s1_t {
int a;
int b;
} s1_t;

int main(void)
{
hid_t file, space, dset, dcpl, filetype; /* Handles */
herr_t status;
s1_t s1;

//Initialize struct
s1.a = 19;
s1.b = 67;

//Create a new file using the default properties.
file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

//Create compound datatype
filetype = H5Tcreate(H5T_COMPOUND, sizeof(s1_t));
H5Tinsert(filetype, "a", HOFFSET(s1_t, a), H5T_NATIVE_INT);
H5Tinsert(filetype, "b", HOFFSET(s1_t, b), H5T_NATIVE_INT);


const hsize_t ndims = 1;
const hsize_t ncols = 1;

hsize_t dims[ndims] = { 1 };
hsize_t max_dims[ndims] = { H5S_UNLIMITED };
hid_t file_space = H5Screate_simple(ndims, dims, max_dims);
std::cout << "- Dataspace created" << std::endl;


hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_layout(plist, H5D_CHUNKED);
hsize_t chunk_dims[ndims] = { 1 };
H5Pset_chunk(plist, ndims, chunk_dims);
std::cout << "- Property list created" << std::endl;

//Create the unlimited dataset.
dset = H5Dcreate(file, DATASET, filetype, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);
std::cout << "- Dataset 'dset1' created " << dset<<std::endl;


status = H5Dwrite(dset, filetype, H5S_ALL, file_space, H5P_DEFAULT, &s1);

//read back the data, extend the dataset,
//and write new data to the extended portions.

//Open file and get the dataset
H5::H5File* file2 = new H5::H5File(FILE, H5F_ACC_RDWR, H5P_DEFAULT);

H5::DataSet* dataset = new H5::DataSet( file2->openDataSet(DATASET));
    
//new data to add to the dataset
s1_t s2;
s2.a = 98;
s2.b = 55;

   
dims[0] = 2;
//dims[1] = ncols;
hid_t mem_space = H5Screate_simple(ndims, dims, NULL);

H5Dset_extent(dset, dims);

file_space = H5Dget_space(dset);
hsize_t start[2] = { 0, 0 };// Start of hyperslab
hsize_t count[2] = { 2, ncols };// Block count
H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);

H5Dwrite(dset, filetype, mem_space, file_space, H5P_DEFAULT, &s2);

The original data is overwritten with values from struct 2, and some other values 0,2 are written. obviously garbage:

HDF5 “h5ex_d_unlimadd.h5” {
GROUP “/” {
DATASET “DS1” {
DATATYPE H5T_COMPOUND {
H5T_STD_I32LE “a”;
H5T_STD_I32LE “b”;
}
DATASPACE SIMPLE { ( 2 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
98,
55
},
(1): {
2,
0
}
}
}
}
}

When I would like something like:

HDF5 “h5ex_d_unlimadd.h5” {
GROUP “/” {
DATASET “DS1” {
DATATYPE H5T_COMPOUND {
H5T_STD_I32LE “a”;
H5T_STD_I32LE “b”;
}
DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
19,
67
}
(1): {
98,
55
}
}
}
}
}

Thanks for your help, I’m getting there… :slight_smile:
Regards,
/P


#6

Can you send us a source file (attachment) that we can compile and edit? Garbled fragments don’t cut it. It’s OK that your example may not produce what you intend, but we can work with something that compiles and runs.

A few comments:

  1. Don’t mix APIs in trivial examples unless that’s your point. It only muddies the water.
  2. Never (unless you really know what you are doing) set the chunk size to 1.
  3. In HDF5, we don’t do things “by rows” and thinking along those lines is like getting up on the wrong foot.
  4. There seems to be confusion about the dataset rank. Rank is the number of dimensions and not to be confused with the extent of a particular dimension. If you want to mimic something that looks like a table, a stream of records, your rank is 1 and the the extent of that dimension is the number of records.
  5. The fields of the compound datatype have nothing to do with the rank of a dataset. In HDF5, it is wrong to think of the fields, the “columns” as a second or additional dimension.

Give us a simple piece in C and we’ll get you on your way!

G.


#7

Hi Gerd,

Thanks for your help. I’ve attached a test.cpp file. Compiles, uses the C API. I’d like to keep a row structure such as :

 DATASPACE  SIMPLE { ( 2, 1 ) / ( H5S_UNLIMITED, 1 ) }
  DATA {
  (0,0): {
        98,
        55
     },
  (1,0): {
        2,
        0
     }

Only because I’m guided by a spec that outlines the data formatting. Thanks again for your insight into hdf5.

Regards,
/P

test.cpp (2.7 KB)


#8

Paul, I believe this is what you want. Right?
I encourage you to read Section 7.4 in the HDF5 User Guide (https://support.hdfgroup.org/HDF5/doc/UG/HDF5_Users_Guide-Responsive%20HTML5/index.html#t=HDF5_Users_Guide%2FDataspaces%2FHDF5_Dataspaces_and_Partial_I_O.htm%23TOC_7_4_Dataspaces_and_Databc-6&rhtocid=7.2)

Best, G.

test.cpp (1.9 KB)


#9

Yes thanks Gerd,
I’m trying now to iter over the dataset and extend each iteration.

Thanks again,
/P


#10

Hi Gerd, managed to successfully extend my dataset. Just getting errors in pandas when trying to open the dataset.

src and dest dataspaces have different number of elements selected

Would that be due to my memspace setup?
Thanks,
/P


#11

Paul, I’m not sure what the question is. Most likely you know more about pandas than I do and I can’t help you with that. Yes, matching numbers of selected elements in the source and destination is a necessary condition for a transfer (read, write) to work. You can retrieve the number of selected elements on any dataspace via H5Sget_select_npoints. If they differ (between source and destination), you have a problem. How could I tell which, if any of the two, is the right number? G.