My program stores text in memory and on disk as UTF-8. Because it can import text from any source at a user’s request, it is possible for it to import text containing invalid UTF-8 sequences (byte sequences that are illegal in UTF-8).
My program detects and handles invalid UTF-8 sequences by displaying invalid bytes using the Unicode replacement character (�) but leaving the invalid bytes in memory until the user removes or replaces them.
My question is: What kind of HDF5 datatype to use when writing UTF-8 text that may contain invalid sequences to an HDF5 file?
My first inclination was to use H5T_C_S1 with H5T_CSET_UTF8, like this:
hid_t datatype_id = H5Tcopy(H5T_C_S1); H5Tset_cset(datatype_id, H5T_CSET_UTF8);
But I am concerned that the HDF5 library will be unhappy if I store invalid UTF-8 text this way. I have been unable to find documentation that explains the implications of using H5T_CSET_UTF8 (i.e., if/how the HDF5 library’s behavior changes if you use H5T_CSET_UTF8).
So now I am thinking of storing it as opaque data, like this:
datatype_id = H5Tcreate(H5T_OPAQUE, length);
Any thoughts? Thanks.