What datatype to use for UTF-8 text


#1

My program stores text in memory and on disk as UTF-8. Because it can import text from any source at a user’s request, it is possible for it to import text containing invalid UTF-8 sequences (byte sequences that are illegal in UTF-8).

My program detects and handles invalid UTF-8 sequences by displaying invalid bytes using the Unicode replacement character (�) but leaving the invalid bytes in memory until the user removes or replaces them.

My question is: What kind of HDF5 datatype to use when writing UTF-8 text that may contain invalid sequences to an HDF5 file?

My first inclination was to use H5T_C_S1 with H5T_CSET_UTF8, like this:

hid_t datatype_id = H5Tcopy(H5T_C_S1);
H5Tset_cset(datatype_id, H5T_CSET_UTF8);

But I am concerned that the HDF5 library will be unhappy if I store invalid UTF-8 text this way. I have been unable to find documentation that explains the implications of using H5T_CSET_UTF8 (i.e., if/how the HDF5 library’s behavior changes if you use H5T_CSET_UTF8).

So now I am thinking of storing it as opaque data, like this:

datatype_id = H5Tcreate(H5T_OPAQUE, length);

Any thoughts? Thanks.


#2

The following works without an error. So it behaves like you want. HDFView displays the replacement character for /valid-latin1 and the correct characters for /valid-utf8. H5py cannot read neither.

#include <string>
#include <H5Cpp.h>

int main(int argc, char **argv)
{
	using namespace H5;
	static std::string valid_utf8("h\xc3\xa4\xc3\x9flich");
	static std::string valid_latin1("h\xe4\xdflich");
	H5File file("test_utf8.h5", H5F_ACC_TRUNC);

	DataSpace space1(H5S_SCALAR);
	StrType type1(PredType::C_S1, valid_utf8.length());
	type1.setCset(H5T_CSET_UTF8);
	DataSet ds1 = file.createDataSet("valid-utf8", type1, space1);
	ds1.write(valid_utf8.data(), type1);

	DataSpace space2(H5S_SCALAR);
	StrType type2(PredType::C_S1, valid_latin1.length());
	type2.setCset(H5T_CSET_UTF8);
	DataSet ds2 = file.createDataSet("valid-latin1", type2, space2);
	ds2.write(valid_latin1.data(), type2);

	return 0;
}