UTF-8 string support

Hello,

Still working in testing my library (build on top of HDF5) in combination with the HSDS functionality (see Fill value support - HSDS - HDF Forum (hdfgroup.org)). In at least one of my test I make use of UTF-8 string (fixed size). Writing this string as attribute works fine and as far as I can tell the string has been correctly stored in the JSON file. This is done using the VOL-REST API. Unfortunately, I am unable to read this string using the VOL-REST API:

HDF5 REST VOL-DIAG: Error detected in HDF5 REST VOL (1.0.0) thread 1: #000: /home/user/projects/vol-rest/src/src/rest_vol_attr.c line 957 in RV_attr_read(): 400 - Malformed/Bad request for resource major: Attribute minor: Read failed

Making a local copy the file using hsget also fails. After a small modification of function jsontoArray() function of the file base.py of the python package h4pyd I got it working. The contents of this local copy is what I expected to be. I also tried to store a UTF-8 using h5pyd directly but fortunately this resulted in an error.

So my question is:
Is currently UTF-8 not supported in h5pyd, HSDS and/or VOL-REST? If so, are there any plans to properly support it?

Note: If you want I can upstream my modified version of jsontoArray() function. My modification is a bit clumsy due to the fact I do not know the exact details of the internal data strctures. Anyway it gives a starting point.

Best regards,
Jan-Willem

– Update –
Regarding storing a UTF-8 string using h5pyd directly, I got that working without any code change. I was not aware there is a difference between np._string_ and np.str_. When using np.str_ is works.

This is likely an issue on the REST VOL’s side - all string datatypes are treated as having an ASCII character set. Thanks for bringing this to my attention, I’ll make an issue for it on Github.

Thanks @mlarson.

Hopefully you or someone else could also implement support of UTF-8 in in the tool hsget. Maybe the other tools have a similar limitation of not supporting UTF-8.

The other day, I noticed a few PR about supporting UTF-8 in HSDS and VOL-REST. Please, let me know if I need to test things or when it is ready to be used.

HSDS now supports fixed width UTF-8 encoded strings, though there hasn’t been a new release since the change was merged. The REST VOL’s main branch can read and write fixed-length UTF-8 to attributes and datasets, but the character set of the datatype will be considered ASCII. There’s a PR to the REST VOL which is still in review to set the character set of the datatype correctly. Hope this answers your question.

The UTF-8 support is in the master branch of HSDS. The corresponding REST VOL PR (Support UTF-8 string datatype encoding by mattjala · Pull Request #87 · HDFGroup/vol-rest · GitHub) is not merged yet, but you are welcome to try it out. There are a lot of edge cases to deal with, so some real world testing would be great.

The hsget support will be a while yet. Am planning to finish up this round of HSDS updates and then create a new h5pyd release that supports the new features.

Hello @mlarson and @jreadey,

Thanks for the update. I also need to update my own code since we use UTF-8 strings but we forgot to indicate this properly in the HDF5 file. So far this works for the native VOL since there is no check on the used character set. Looking at your recent changes such a check does exist for REST VOL and HSDS.

Hello @mlarson and @jreadey,

Finally, I have some time to test the changes related to UTF-8 support in both HSDS and REST VOL. After a small bug fix in REST VOL, I got it working. For our library we have several tests targeting POSIX file system which I converted to REST VOL and they all pass. Currently, the function H5Ocopy() and H5Lmove() are not yet supported but fortunately these functions are not critical in our workflows.
Regarding the bug fix, I created a PR for it.

Thank you all for implementing UTF-8 support.

2 Likes

Glad to hear everything is (more or less) working!

Support for H5Ocopy and H5Lmove is on our radar screen but might be a couple of months before it’s ready.

2 Likes