UTF-8 encoding

Hello everyone,

I have recently posted a question regarding UTF-8 encoding of attributes. In fact, it works for me using HDF5 JNI and Java.

My only problem was a confusing description in the document I found. Here is the code snippet I found and mentioned before:

···

--------------------------------------------------------------------------------------------------------------
datatype_id = H5Tcopy(H5T_C_S1) ;
    error = H5Tset_cset(datatype_id, H5T_CSET_UTF8) ;
    error = H5Tset_size(datatype_id, "8") ;
———————————————————————————————
I was confused regarding the meaning of the data size parameter (“8”) - is it the number of bytes or the number of code points?

It did a test and I figured out that the size argument for H5Tset_size is the number of bytes in the byte buffer containing UTF-8 encoded string.
So basically you have to know the size of that byte buffer before you can encode a dataset as a UTF-8 string.

So here are the essential steps to write an attribute whose name and dataset are both UTF-8 strings:

(1) Modify the standard H5T_C_S1 type by setting the char set to H5T_CSET_UTF8
(2) Obtain the length of the byte buffer which contains the UTF-8 encoded string
(3) Set the size of the modified datatype to the length of the byte buffer
(4) Create a scalar dataspace
(5) create a modified "attribute create property list” and set the char set to UTF-8
(6) create an attribute using the modified creation property list
(7) write the attribute

Here is code from my Java program to add an attribute whose name and dataset are both UTF-8 strings:

//*********** adding String attribute encoded with UTF-8 char set ***************

private void add_UTF8_Attr(int loc_id, String attrname, String attr) {
        int dataspace_id = -1;
        int attribute_id = -1;
        int datatype_id = -1;
        try {
            byte[] string2bytes = attr.getBytes("UTF8");
            // int num_chars = attr.length();
            int num_bytes = string2bytes.length;
         
            datatype_id = H5.H5Tcopy(HDF5Constants.H5T_C_S1);
            H5.H5Tset_cset(datatype_id, HDF5Constants.H5T_CSET_UTF8);
            H5.H5Tset_size(datatype_id, num_bytes);
            
            // Create a scalar dataspace for the string attribute.
            dataspace_id = H5.H5Screate(HDF5Constants.H5S_SCALAR);
            
            //create modified "attribute create property list"
            // and set the character set to UTF-8
            int acpl_id = H5.H5Pcreate(HDF5Constants.H5P_ATTRIBUTE_CREATE);
            H5.H5Pset_char_encoding(acpl_id, HDF5Constants.H5T_CSET_UTF8) ;
            // Create an attribute with modified creation property list
            attribute_id = H5.H5Acreate(loc_id, attrname,
                    datatype_id, dataspace_id,
                    acpl_id, HDF5Constants.H5P_DEFAULT);
           
            if (attribute_id >= 0)
                H5.H5Awrite(attribute_id, datatype_id, string2bytes);
            H5.H5Sclose(dataspace_id);
            H5.H5Aclose(attribute_id);
            H5.H5Tclose(datatype_id);
            H5.H5Pclose(acpl_id);
        } catch (HDF5Exception|NullPointerException|UnsupportedEncodingException ex) {
            Logger.getLogger(SimpleH5.class.getName()).log(Level.SEVERE, null, ex);
          
        }
    }

//************************************************************************************************

Alexey