UTF-8 encoding

Alexey · December 29, 2016, 4:54am

Hello HDF people,

I am relatively new (~5 months) to the HDF concept but I am becoming a true enthusiast.

I am using Java HDF5 Interface (JHI5) <https://support.hdfgroup.org/products/java/JNI3/jhi5/index.html> in Java programs. I am interested in simple ways to use HDF5 as a binder for a collection of diverse datatypes with annotations at the level of the file as well as at the level of individual datasets. Clearly, HDF5 Attribute API provides a rich framework for purposes of annotation. However, I want to have only String annotations, so I am interested in String attributes.

I want to use UTF-8 encoded Strings of arbitrary length. I was looking for some explanations regarding UTF-8 encoding and I found this

https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/

In particular this document has the following snippet:
For example, the following commands could be used to create an 8-character, UTF-8 encoded, string datatype for use in either an attribute or dataset:
    datatype_id = H5Tcopy(H5T_C_S1) ;
    error = H5Tset_cset(datatype_id, H5T_CSET_UTF8) ;
    error = H5Tset_size(datatype_id, "8") ;
This is puzzling because “set_size” functionality can work properly only if the size required by UTF-8 String is known in advance. However, in general this is not the case because UTF-8 characters may take from 1 to 4 bytes. I understand that most of HDF5 users use ASCII al the time and in that case this will work. Still, in general case it seems to be plain wrong.

In other words, the only proper way to create a UTF-encoded string datatype is to provide a function which computes the size from the string object itself.

In fact, I currently store my String attributes just as 1D byte array datasets. It is very easy to convert between Strings and byte arrays in Java. Works fine for me. My only discomfort is that the resulting HDF5 file is not a “proper” HDF5 file, in a sense that a 3-rd party reader of my HDF5 file will not be able to interpret these attributes without additional information.

Summarizing, I would like to write and read UTF-8 strings as attributes using Java and still preserve fully the "self-describing” feature of the HDF5 format. Please advice.

Thanks,
Alexey

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

UTF-8 encoding