Extent the dataspace of an existing attribute

Hello,

Suppose you have have store an attribute with represents an array of several elements and each element contains for example 2 fixed sized strings.

Is there an easy way to extent the dataspace of this attribute? I mean something similar like H5Dset_extent().

The way we currently do it, is by temporarily store all the attribute information in memory, delete the attribute and closes the associate attribute identifier and attribute dataspace identifier. Next, we create the attribute again with the extented dataspace and store the original information again in the “newly” created attribute. This works but if feels a bit cumbersome to extent the dataspace. So I am wondering could this be done more easily.

Best regards,
Jan-Willem

Jan-Willem, how are you? No, there is no H5Aset_extent function at this time. I think you have at least two options to mimic the behavior for which you are looking.
0. (Naming convention) Multiple attributes instead of an “attribute array .” How many are there?

  1. (VLEN) If your array is one-dimensional, you could use a variable-length sequence data type for your attribute. The only downside would be that you have to write the entire sequence, even when you are just adding one element. Also, for multidimensional attribute arrays, you’d have to linearize and keep track of the dimensions.
  2. (Indirection) Instead of storing the (array) value in an attribute, you could store it in an (extendible) dataset and instead store an object reference to that “attribute dataset” in the original attribute. It won’t look as pretty in your profile, perhaps a little opaque, but will do the trick and give you benefits such as partial I/O or “shared attributes.” (BTW, the newer object references can refere to HDF5 objects (groups, datasets, comitted datatypes) in other files.)

How many elements do these attribute arrays typically have and how big are the strings?

Best, G.

1 Like

Hello Gerd,

I am doing. I hope you too.

It is unfortunate that H5Aset_exent() function does not exist because it looks like it could do exactly what I wanted to do. Understandable you have some questions about what I want to do. Let me try to put things in context.

Currently we have a multi-dimensional dataset which stores the data like this (#quantities, #dim1, …, #dimN) where #quantities is the number of physical quantities and #dim1 is the number of elements for the first dimension, etc. As a example, if you would like to store velocity, density and temperature of a cubic with dimensional elements of 101 x 71 x 51 , the dataset would have for example an extent of (3, 101, 71, 51). Furthermore, for each quantity we store its name and its unit system. This is done using HDF5 attribute for which we create an compound type which holds to 2 strings of 64 elements and the dataspace of the attribute correspond to the number of attribute we want to store.

In our code, it is possible to add another physical quantity to an existing dataset. We use H5Dset_extent to adjust the extent of the dataset. Next we need to update the attribute which holds the physical quanties to add to newly added quantity. Internally, we make sure that there is a one-to-one between quantiy index and index of the attribute array.

Hopefully, this gives a bit of context. What I have seen so far the number of physical quantities can vary from 1 up till 50 and the number 50 is rather an edge case.

Best regards,
Jan-Willem

Hi Jan-Willem,
Attributes don’t come with the same flexibility as dataset, as @gheber pointed out on one hand, on the other hand they tend to be small in size. In fact attributes are limited to 64KB; did you consider overwriting
image

the existing attribute with the updated information?

best wishes: steven

@jan-willem.blokland, you are constrained by having diverse physical quantities inside a single dataset. I suggest a different strategy with a separate dataset for each physical quantity.

I dare say this is “the easy way” that you asked for. Each dataset will have its own scalar attributes for name, units, and other useful metadata. Attributes with dimensions and related problems are avoided. Updates are simplified. It is easy to iterate over datasets when reading, and to add new datasets. There are other benefits such as per-dataset data type, dimensionality, precision, and compression strategy.

HDF5 is intended for this strategy. This is also approximately the netCDF scientific data model which is widely used in some sectors.

If your current application requires multiple “composite” datasets inside a single file, then use HDF5 groups to consolidate the related single data sets.

Perhaps you are too deeply invested in the shared dataset strategy. If so, consider this for the next application.

I thank and agree with my fellow responders @steven and @dave.allured.

Maybe the path of least resistance might be a “dimensions” attribute (NetCDF :wink:) whose type is a variable-length sequence of the string-pair compound (which presumably holds the quantity name or symbol and the unit designation). The length and order of elements of the sequence would be expected to be in sync with the leading dimension of your extendible dataset. The only quirk would be that appending a new dimension would mean that you must H5Aread the whole sequence, extend it (in memory), and then H5Dwrite the whole attribute again. Not pretty, but, unless your are constantly changing dimensions, not really a performance bottleneck for a few string-pair records.

Best, G.

Hello @gheber, @steven and @dave.allured,

Thanks for all your comments and suggestions. It is always interesting to talk to fellow software engineers and developers. Often you get so many good suggestions that it becomes difficult to choose most appropriate one for the problem you want to solve.

Currently, we have several applications which read and/or write data in own format (application dependent). As a result, for the average user it is not so straightforward to use data written by one application and use it as input for another application. To solve it we have written an library built on top of HDF5 to streamline it all. So in principle we could change how we store the data in HDF5 without the need to change all the applications which make use of the library.

Furthermore, typically these own formats are not flexible at all like HDF5. In that sense, switching to HDF5 is a huge difference. Our current data storage design is some kind of middleground. Introduce some fexibility without going completely overboard with it. Ideal? Probably not. Time will tell. For the moment we stick with our current solution, which is more or less what @gheber describes in his last comment. Nevertheless, good to know there are other options too. Thanks for mentioning there is a size limit of 64KB for an attribute.

About the other options, I understand both options of @dave.allured and @steven, but I have no idea how to implement @steven’s option. Is there an simple example in C of this option?

Best regards,

1 Like

This limitation is in effect only for the pre-2.0 versions of the file format specification. Dense attribute storage was introduced in version 2.0 of the file format specification and implemented in HDF5 1.8.0. After that, attribute values can be of arbitrary size. I believe the API for dense attribute storage (H5Pset_attr_phase_change) was also introduced in HDF5 1.8.0, and any later version will support it.

Best, G.

1 Like

Thanks @gheber for the information. Good to know that there no limitation regarding the attribute size in the latest HDF5 1.12.1 version. It looks like one of those knobs you have no clue what it does untill you need it.

Example to rewrite attributes

While it is not possible to append/extend attributes in HDF5, attributes often represent side band information with relatevily small size. In fact in previous HDF5 versions the attribute size was limited to 64kb, however Gerd Heber suggest this limitation has been lifted.

Haveing said the above it is a good strategy to break up append operation to read old dataset and write new dataset operations. The implementation is straightforward, and when used properly also is performant.

#include <vector>
#include <armadillo>
#include <h5cpp/all>

int main(void) {
    h5::fd_t fd = h5::create("h5cpp.h5",H5F_ACC_TRUNC);
    arma::mat data(10,5);
    
    { // we're creating a dataset from armadillo matrix, and adding a vector of integers as attribute
    h5::ds_t ds = h5::write(fd,"some dataset", data);  // write dataset, and obtain descriptor
    h5::awrite(ds, "attribute name", {1,2,3,4,5,6,7});
    }
}

will give you the following layout

h5dump -a /some_dataset/attribute_name  h5cpp.h5
HDF5 "h5cpp.h5" {
ATTRIBUTE "attribute_name" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 7 ) / ( 7 ) }
   DATA {
   (0): 1, 2, 3, 4, 5, 6, 7
   }
}
}

To update the attribute you need to remove it first, since H5CPP doesn’t yet do this automatically; in fact there is no h5::adelete either! – however by design you can interchange HDF5 C API calls with H5CPP templates, so here is the update with H5Adelete and h5::awrite:

H5Adelete(ds,  "attribute name");
h5::awrite(ds, "attribute name", values);

And now you can see the updated values from 20-26

h5dump -a /some_dataset/attribute_name  h5cpp.h5
HDF5 "h5cpp.h5" {
ATTRIBUTE "attribute_name" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 14 ) / ( 14 ) }
   DATA {
   (0): 1, 2, 3, 4, 5, 6, 7, 20, 21, 22, 23, 24, 25, 26
   }
}
}

best wishes: steve

1 Like