Still working in testing my library (build on top of HDF5) in combination with the HSDS functionality (see Fill value support - HSDS - HDF Forum (hdfgroup.org)). In at least one of my test I make use of UTF-8 string (fixed size). Writing this string as attribute works fine and as far as I can tell the string has been correctly stored in the JSON file. This is done using the VOL-REST API. Unfortunately, I am unable to read this string using the VOL-REST API:
HDF5 REST VOL-DIAG: Error detected in HDF5 REST VOL (1.0.0) thread 1: #000: /home/user/projects/vol-rest/src/src/rest_vol_attr.c line 957 in RV_attr_read(): 400 - Malformed/Bad request for resource major: Attribute minor: Read failed
Making a local copy the file using
hsget also fails. After a small modification of function
jsontoArray() function of the file
base.py of the python package
h4pyd I got it working. The contents of this local copy is what I expected to be. I also tried to store a UTF-8 using h5pyd directly but fortunately this resulted in an error.
So my question is:
Is currently UTF-8 not supported in h5pyd, HSDS and/or VOL-REST? If so, are there any plans to properly support it?
Note: If you want I can upstream my modified version of
jsontoArray() function. My modification is a bit clumsy due to the fact I do not know the exact details of the internal data strctures. Anyway it gives a starting point.
– Update –
Regarding storing a UTF-8 string using h5pyd directly, I got that working without any code change. I was not aware there is a difference between
np.str_. When using
np.str_ is works.
This is likely an issue on the REST VOL’s side - all string datatypes are treated as having an ASCII character set. Thanks for bringing this to my attention, I’ll make an issue for it on Github.
Hopefully you or someone else could also implement support of UTF-8 in in the tool
hsget. Maybe the other tools have a similar limitation of not supporting UTF-8.
The other day, I noticed a few PR about supporting UTF-8 in HSDS and VOL-REST. Please, let me know if I need to test things or when it is ready to be used.
HSDS now supports fixed width UTF-8 encoded strings, though there hasn’t been a new release since the change was merged. The REST VOL’s main branch can read and write fixed-length UTF-8 to attributes and datasets, but the character set of the datatype will be considered ASCII. There’s a PR to the REST VOL which is still in review to set the character set of the datatype correctly. Hope this answers your question.
The UTF-8 support is in the master branch of HSDS. The corresponding REST VOL PR (Support UTF-8 string datatype encoding by mattjala · Pull Request #87 · HDFGroup/vol-rest · GitHub) is not merged yet, but you are welcome to try it out. There are a lot of edge cases to deal with, so some real world testing would be great.
The hsget support will be a while yet. Am planning to finish up this round of HSDS updates and then create a new h5pyd release that supports the new features.
Hello @mlarson and @jreadey,
Thanks for the update. I also need to update my own code since we use UTF-8 strings but we forgot to indicate this properly in the HDF5 file. So far this works for the native VOL since there is no check on the used character set. Looking at your recent changes such a check does exist for REST VOL and HSDS.