Unicode and string conversion in HDF5

Hi,

I'm curious as to how HDF5 treats character-set information during
type conversion. There are functions H5Tset_cset and H5Tget_cset in
the API. What happens if I try to read data defined as H5T_CSET_ASCII
into a buffer defined as H5T_CSET_UTF8, and vice-versa? What if the
buffer defined as ASCII contains characters > 127, but isn't UTF-8
compliant? I ask because in practice I've noticed that H5T_CSET_ASCII
seems to be used to indicate data of an unknown encoding.

Thanks,
Andrew

Hi Andrew,

test_utf8.c (2.14 KB)

···

On Apr 26, 2011, at 8:07 PM, Andrew Collette wrote:

Hi,

I'm curious as to how HDF5 treats character-set information during
type conversion. There are functions H5Tset_cset and H5Tget_cset in
the API. What happens if I try to read data defined as H5T_CSET_ASCII
into a buffer defined as H5T_CSET_UTF8, and vice-versa? What if the
buffer defined as ASCII contains characters > 127, but isn't UTF-8
compliant? I ask because in practice I've noticed that H5T_CSET_ASCII
seems to be used to indicate data of an unknown encoding.

  Sorry for the delay in reply, I wanted to verify the library's behavior and it took a little while to find a gap in my schedule.

  Currently, the library will neither convert the data, nor fail to perform a read/write operation with two different character sets - it treats the UTF-8 and ASCII string datatypes as identical (see attached little test program).

  However, I'm inclined to change that behavior and have the conversion fail, so that application developers and users aren't surprised. Then, once we find out the correct behavior and can implement a bridge between the two character sets (at least from ASCII to UTF-8), we can enable the proper behavior.

  How's that sound to people?

  Quincey

Hi All,
  I haven't heard any further feedback about the character set conversion issue, so I'm going to put an issue in our tracker to disable the conversion that's currently allowed, and revisit it later.

  Quincey

···

On May 3, 2011, at 7:16 AM, Quincey Koziol wrote:

Hi Andrew,

On Apr 26, 2011, at 8:07 PM, Andrew Collette wrote:

Hi,

I'm curious as to how HDF5 treats character-set information during
type conversion. There are functions H5Tset_cset and H5Tget_cset in
the API. What happens if I try to read data defined as H5T_CSET_ASCII
into a buffer defined as H5T_CSET_UTF8, and vice-versa? What if the
buffer defined as ASCII contains characters > 127, but isn't UTF-8
compliant? I ask because in practice I've noticed that H5T_CSET_ASCII
seems to be used to indicate data of an unknown encoding.

  Sorry for the delay in reply, I wanted to verify the library's behavior and it took a little while to find a gap in my schedule.

  Currently, the library will neither convert the data, nor fail to perform a read/write operation with two different character sets - it treats the UTF-8 and ASCII string datatypes as identical (see attached little test program).

  However, I'm inclined to change that behavior and have the conversion fail, so that application developers and users aren't surprised. Then, once we find out the correct behavior and can implement a bridge between the two character sets (at least from ASCII to UTF-8), we can enable the proper behavior.

  How's that sound to people?

  Quincey

<test_utf8.c>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org