Best practice for sizing string attribute

hoehn_michael · January 7, 2020, 5:02pm

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5T.html#Datatype-SetSize states that for herr_t H5Tset_size( hid_t dtype_id, size_tsize ) "String or character datatypes: The size set for a string datatype should include space for the null-terminator character, otherwise it will not be stored on (or retrieved from) disk.
However, HDFView allows you to store strings without space for the null terminator.
Is it best practice to make space for null terminator or to make string the size of the data. (ie. string “test” should this be 4 or 5?)

Thanks,
Michael

gheber · January 8, 2020, 1:56am

Michael, how are you? Generally, please refer to the newer documentation at https://portal.hdfgroup.org/display/HDF5/Core+Library . The answer to your questions has two parts:

The formulation, which is there also in the newer version of the documentation, is misleading (our fault!), and I’ll explain why.
Without more context, the answer to what’s the best practice is, “It depends.”

Ad 1.) The explicit construction of a string datatype has three steps. It begins with choosing the “right size” for your string datatype. If we ignore the case of variable-length strings, picking a suitable fixed size requires a certain look-ahead: In steps two and three, we will have to choose a character encoding (ASCII or UTF-8) and decide on a string padding. Both may have an influence on what’s a suitable size. If we go with UTF-8 encoding, it’d be prudent to make the size four times the number of prospective Unicode code points, because the UTF-8 encoding of an individual code point requires between one and four bytes. Assuming we’ve settled on a suitable size, there is still the issue of string padding. See https://portal.hdfgroup.org/display/HDF5/H5T_SET_STRPAD . If we go with H5T_STR_NULLTERM, we have to bump our current size candidate by one to make room for the terminating null. The writer of the formulation in https://portal.hdfgroup.org/display/HDF5/H5T_SET_SIZE made an unstated assumption about the string padding, and that’s why the formulation is misleading. (I will file a ticket to get that fixed.)

Ad 2.) The previous discussion, I hope, makes clear that there are several choices, and there is no unconditional best practice. The good news is that the HDF5 library will convert between language-specific string representations. For example, strings written by a FORTRAN program can be read or updated correctly from a C program. You can even write FORTRAN-stored strings from C and vice versa, but unless you have a specific purpose, it’ll be confusing at best. What’s your (language) environment and who are the consumers of your data (and what does their environment look like)? That should help you to settle the issue.

Best, G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Best practice for sizing string attribute