What datatype to use for UTF-8 text

howard · September 9, 2018, 8:35pm

My program stores text in memory and on disk as UTF-8. Because it can import text from any source at a user’s request, it is possible for it to import text containing invalid UTF-8 sequences (byte sequences that are illegal in UTF-8).

My program detects and handles invalid UTF-8 sequences by displaying invalid bytes using the Unicode replacement character (�) but leaving the invalid bytes in memory until the user removes or replaces them.

My question is: What kind of HDF5 datatype to use when writing UTF-8 text that may contain invalid sequences to an HDF5 file?

My first inclination was to use H5T_C_S1 with H5T_CSET_UTF8, like this:

hid_t datatype_id = H5Tcopy(H5T_C_S1);
H5Tset_cset(datatype_id, H5T_CSET_UTF8);

But I am concerned that the HDF5 library will be unhappy if I store invalid UTF-8 text this way. I have been unable to find documentation that explains the implications of using H5T_CSET_UTF8 (i.e., if/how the HDF5 library’s behavior changes if you use H5T_CSET_UTF8).

So now I am thinking of storing it as opaque data, like this:

datatype_id = H5Tcreate(H5T_OPAQUE, length);

Any thoughts? Thanks.

scholz · September 18, 2018, 7:58am

The following works without an error. So it behaves like you want. HDFView displays the replacement character for /valid-latin1 and the correct characters for /valid-utf8. H5py cannot read neither.

#include <string>
#include <H5Cpp.h>

int main(int argc, char **argv)
{
	using namespace H5;
	static std::string valid_utf8("h\xc3\xa4\xc3\x9flich");
	static std::string valid_latin1("h\xe4\xdflich");
	H5File file("test_utf8.h5", H5F_ACC_TRUNC);

	DataSpace space1(H5S_SCALAR);
	StrType type1(PredType::C_S1, valid_utf8.length());
	type1.setCset(H5T_CSET_UTF8);
	DataSet ds1 = file.createDataSet("valid-utf8", type1, space1);
	ds1.write(valid_utf8.data(), type1);

	DataSpace space2(H5S_SCALAR);
	StrType type2(PredType::C_S1, valid_latin1.length());
	type2.setCset(H5T_CSET_UTF8);
	DataSet ds2 = file.createDataSet("valid-latin1", type2, space2);
	ds2.write(valid_latin1.data(), type2);

	return 0;
}

howard · September 24, 2018, 5:03pm

Thanks for your investigation.

I replicated your tests using the C API. I tested using scalar fixed-length and variable-length H5T_CSET_UTF8 string datasets containing valid ASCII, valid UTF-8, and invalid UTF-8.

I found that HDFView 2.13 loads all of my string datasets without error, including those that contain invalid UTF-8 sequences.

I got errors in HDFView 3.0 even in loading valid UTF-8. I reported this as a possible bug to hdfgroup and received a confirmation from them.

I searched the HDF5 1.10.2 library source code for H5T_CSET_UTF8. I see nothing that attempts to validate the text in a UTF-8 string dataset so, as it stands now, I am confident that I can write and read UTF-8 strings even if they contain invalid UTF-8 sequences.

I’m somewhat concerned that the HDF group could change this in a future version of the library, adding validation, and that code using a future version of the library might return an error for datasets that work with the current version. Hopefully hdfgroup programmers would realize that such validation would break compatibility with existing datasets.

howard · September 24, 2018, 5:05pm

Attached is the test file that I created.
UTF8Strings.h5 (10 KB)

paramon · September 25, 2018, 9:39am

Hi Howard, all!

24.09.2018 20:09, Howard Rodstein пишет:

scholz:

The following works without an error. So it behaves like you want.
HDFView displays the replacement character for /valid-latin1 and the
correct characters for /valid-utf8. H5py cannot read neither.
Thanks for your investigation.

I replicated your tests using the C API. I tested using scalar
fixed-length and variable-length H5T_CSET_UTF8 string datasets
containing valid ASCII, valid UTF-8, and invalid UTF-8.

I found that HDFView 2.13 loads all of my string datasets without error,
including those that contain invalid UTF-8 sequences.

I got errors in HDFView 3.0 even in loading valid UTF-8. I reported this
as a possible bug to hdfgroup and received a confirmation from them.

I searched the HDF5 1.10.2 library source code for H5T_CSET_UTF8. I see
nothing that attempts to validate the text in a UTF-8 string dataset so,
as it stands now, I am confident that I can write and read UTF-8 strings
even if they contain invalid UTF-8 sequences.

I’m somewhat concerned that the HDF group could change this in a future
version of the library, adding validation, and that code using a future
version of the library might return an error for datasets that work with
the current version. Hopefully hdfgroup programmers would realize that
such validation would break compatibility with existing datasets.

I have checked and indeed, h5py fails to read e.g.
Latin1VariableStringClaimingToBeUTF8 out-of-the-box. However, this could
be fixed relatively easily by providing “replace” instead of NULL for
PyUnicode_DecodeUTF8 calls at

github.com/gem/ubuntu-h5py

h5py/_conv.pyx

038a81ad6


      
          
          # When reading we identify H5T_CSET_ASCII as a byte string and
          # H5T_CSET_UTF8 as a utf8-encoded unicode string
          if sizes.cset == H5T_CSET_ASCII:
              if buf_cstring[0] == NULL:
                  temp_obj = PyBytes_FromString("")
              else:
                  temp_obj = PyBytes_FromString(buf_cstring[0])
          elif sizes.cset == H5T_CSET_UTF8:
              if buf_cstring[0] == NULL:
                  temp_obj = PyUnicode_DecodeUTF8("", 0, NULL)
              else:
                  temp_obj = PyUnicode_DecodeUTF8(buf_cstring[0], strlen(buf_cstring[0]), NULL)
          
          # Since all data conversions are by definition in-place, it
          # is our responsibility to free the memory used by the vlens.
          free(buf_cstring[0])
          
          # HDF5 will eventuallly overwrite this target location, so we
          # make sure to decref the object there.
          Py_XDECREF(bkg_obj[0])

github.com/gem/ubuntu-h5py

h5py/_conv.pyx

038a81ad6


      
          # H5T_CSET_UTF8 as a utf8-encoded unicode string
          if sizes.cset == H5T_CSET_ASCII:
              if buf_cstring[0] == NULL:
                  temp_obj = PyBytes_FromString("")
              else:
                  temp_obj = PyBytes_FromString(buf_cstring[0])
          elif sizes.cset == H5T_CSET_UTF8:
              if buf_cstring[0] == NULL:
                  temp_obj = PyUnicode_DecodeUTF8("", 0, NULL)
              else:
                  temp_obj = PyUnicode_DecodeUTF8(buf_cstring[0], strlen(buf_cstring[0]), NULL)
          
          # Since all data conversions are by definition in-place, it
          # is our responsibility to free the memory used by the vlens.
          free(buf_cstring[0])
          
          # HDF5 will eventuallly overwrite this target location, so we
          # make sure to decref the object there.
          Py_XDECREF(bkg_obj[0])
          
          # Write the new string object to the buffer in-place

But storing Latin1-encoded text in UTF-8 slot and explicitly marking
it as UTF-8 is a really, REALLY terrible idea. It is bad even compared
to H5T_OPAQUE, which is itself not most user-friendly.

I believe by raising an exception, h5py currently does the right thing.

Best wishes,
Andrey Paramonov

howard · September 25, 2018, 2:40pm

What would you recommend for this situation:

You need to save text read from any user-chosen file as a dataset in an HDF5 file.
The user can choose any text file. In the vast majority of cases, it will be valid UTF-8, but occasionally it will be invalid.
If the user opens a text file that is not valid as UTF-8, you do not want to cleanse the file at that point because the user may choose to reinterpret it as some other text encoding at a later time.

Thanks.

paramon · September 25, 2018, 3:31pm

Hi Howard!

25.09.2018 17:46, Howard Rodstein пишет:

What would you recommend for this situation:

You need to save text read from any user-chosen file as a dataset in
an HDF5 file.

The user can choose any text file. In the vast majority of cases, it
will be valid UTF-8, but occasionally it will be invalid.

If the user opens a text file that is not valid as UTF-8, you do not
want to cleanse the file at that point because the user may choose
to reinterpret it as some other text encoding at a later time.

Determine the file encoding at selection time; often you can guess
(https://chardet.readthedocs.io), but asking user confirmation is more
reliable. Decode the text using specified encoding, encode into UTF-8,
and store to HDF5 file as UTF-8 text.

Best wishes,
Andrey Paramonov

howard · September 25, 2018, 3:58pm

This does not address the situation I described. “often you can guess” is not reliable, nor is asking the user who often does not know what a text encoding is.

If I cleansed files (i.e., replaced invalid bytes with the Unicode replacement character when opened), this would not be a problem. But I am reluctant to do that because it precludes the user from later reinterpreting a file using a different text encoding at a later time.

I am leaning toward representing valid UTF-8 as string datasets with H5T_CSET_UTF8 encoding and representing invalid UTF-8 as H5T_OPAQUE.

paramon · September 26, 2018, 6:38am

Hi Howard!

25.09.2018 19:04, Howard Rodstein пишет:

Andrey_Paramonov:

Determine the file encoding at selection time; often you can guess
(https://chardet.readthedocs.io), but asking user confirmation is more
reliable. Decode the text using specified encoding, encode into UTF-8,
and store to HDF5 file as UTF-8 text.
This does not address the situation I described. “often you can guess”
is not reliable, nor is asking the user who often does not know what a
text encoding is.

If I cleansed files (i.e., replaced invalid bytes with the Unicode
replacement character when opened), this would not be a problem. But I
am reluctant to do that because it precludes the user from later
reinterpreting a file using a different text encoding at a later time.

I am leaning toward representing valid UTF-8 as string datasets with
H5T_CSET_UTF8 encoding and representing invalid UTF-8 as H5T_OPAQUE.

If the user doesn’t know what is the encoding of her text files, that
means either

We are in the bright future of http://utf8everywhere.org,

or

The user’s file is just a pile of bytes.
H5T_OPAQUE is a wise choice in this case.

Best wishes,
Andrey Paramonov

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

What datatype to use for UTF-8 text