string data with encoding problems

Hello,

I read / write strings in my HDF files to copy data between Matlab and my C++ code. I have some problems with ASCII Codes greater 127 in my files.

The dump of a HDF5 file (Matlab) shows:
GROUP "/" {
DATASET "data" {
   DATATYPE H5T_STRING {
         STRSIZE 2;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
   DATA {
   (0): "\37777777744", "\37777777766", "\37777777774", "\37777777737"
   }
}
}

The chars are "ä", "ö", "ü", "ß". My code creates the same chars of a string with string.c_str() to:
GROUP "/" {
DATASET "test" {
   DATATYPE H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
   DATA {
   (0): "\37777777703\37777777644", "\37777777703\37777777666",
   (2): "\37777777703\37777777674", "\37777777703\37777777637"
   }
}
}

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte. Can I switch the encoding in the HDF5 file or can I use unicode or anything else?

Thanks

Phil

Hello,

I read / write strings in my HDF files to copy data between Matlab and my C++ code. I have some problems with ASCII Codes greater 127 in my files.

Then they aren't ASCII codes since ASCII only defines those
between 0 and 127 and I guess that's were the problems starts...

The dump of a HDF5 file (Matlab) shows:
GROUP "/" {
DATASET "data" {
   DATATYPE H5T_STRING {
         STRSIZE 2;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
   DATA {
   (0): "\37777777744", "\37777777766", "\37777777774", "\37777777737"

This is basically '0xE4', '0xF6', '0xFC' and '0xDF', which are
'ä', 'ö', 'ü' and 'ß' in e.g. the ISO-8859-1 encoding. Of course
modulo the length of the values, they are output as 32-bit while
a char would typically only have 8.

   }
}
}

The chars are "ä", "ö", "ü", "ß". My code creates the same chars of a string with string.c_str() to:
GROUP "/" {
DATASET "test" {
   DATATYPE H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
   DATA {
   (0): "\37777777703\37777777644", "\37777777703\37777777666",
   (2): "\37777777703\37777777674", "\37777777703\37777777637"
   }

The first number is '0xC3' and '0xA4', which is the UTF-8 represen-
tation of 'ä' (didn't check the rest, but the '0xC3' at the start
of all of them smells a lot like they all are UTF-8).

}
}

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte.
Can I switch the encoding in the HDF5 file or can I use unicode or anything
else?

Don't know if you can get MatLab to use UTF-8, concerning the
HDF5 file the question is how it is written. The program it's
written by seems to use UTF-8. Can you change that? If this
is a test program with the 'äöüß' hard-coded into it you just
may have to get your editor to use ISO-8859-1. Writing to the
HDF5 file isn't the point, it just contains what you told it
to, the problem is passing in the values you want it to con-
tain.
                              Regards, Jens

···

On Tue, Apr 26, 2011 at 12:24:40PM +0200, Kraus Philipp wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

I've tested the code with ISO-8859-1 (change my source code, because the chars are hardcoded), than the encoding is correct (C++ & Matlab). But my problem I read XML data with utf-8 encoding (some cyrillic chars, etc). I would like to write them to the HDF and read them into Matlab. Do you have any idea with a correct encoding?

If I read / write the char data in my code, I can set the correct encoding representation, but is there an option for saving the encoding into the file, because only the char code isn't unique. I can set utf-8, utf-8, ASCII, etc

Thx

Phil

···

Am 26.04.2011 um 12:52 schrieb Jens Thoms Toerring:

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte.
Can I switch the encoding in the HDF5 file or can I use unicode or anything
else?

Don't know if you can get MatLab to use UTF-8, concerning the
HDF5 file the question is how it is written. The program it's
written by seems to use UTF-8. Can you change that? If this
is a test program with the 'äöüß' hard-coded into it you just
may have to get your editor to use ISO-8859-1. Writing to the
HDF5 file isn't the point, it just contains what you told it
to, the problem is passing in the values you want it to con-
tain.

Hi all,

···

On Apr 26, 2011, at 6:12 AM, Kraus Philipp wrote:

Am 26.04.2011 um 12:52 schrieb Jens Thoms Toerring:

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte.
Can I switch the encoding in the HDF5 file or can I use unicode or anything
else?

Don't know if you can get MatLab to use UTF-8, concerning the
HDF5 file the question is how it is written. The program it's
written by seems to use UTF-8. Can you change that? If this
is a test program with the 'äöüß' hard-coded into it you just
may have to get your editor to use ISO-8859-1. Writing to the
HDF5 file isn't the point, it just contains what you told it
to, the problem is passing in the values you want it to con-
tain.

I've tested the code with ISO-8859-1 (change my source code, because the chars are hardcoded), than the encoding is correct (C++ & Matlab). But my problem I read XML data with utf-8 encoding (some cyrillic chars, etc). I would like to write them to the HDF and read them into Matlab. Do you have any idea with a correct encoding?

If I read / write the char data in my code, I can set the correct encoding representation, but is there an option for saving the encoding into the file, because only the char code isn't unique. I can set utf-8, utf-8, ASCII, etc

  Would the H5Tset_cset() routine (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5T.html#Datatype-SetCset) help out here?

    Quincey

Thanks, but if I use this call:

H5::StrType l_str(0, p_strlen+1);
l_str.setCset( H5T_CSET_UTF8 );

the dump of a file shows:
       DATATYPE H5T_STRING {
             STRSIZE 29;
             STRPAD H5T_STR_NULLTERM;
             CSET H5T_CSET_ASCII;
             CTYPE H5T_C_S1;
          }

No exception is thrown or something else. It seems, that the string object does not set the H5T_CSET_UTF8

Phil

···

Am 26.04.2011 um 15:34 schrieb Quincey Koziol:

Hi all,

On Apr 26, 2011, at 6:12 AM, Kraus Philipp wrote:

Am 26.04.2011 um 12:52 schrieb Jens Thoms Toerring:

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte.
Can I switch the encoding in the HDF5 file or can I use unicode or anything
else?

Don't know if you can get MatLab to use UTF-8, concerning the
HDF5 file the question is how it is written. The program it's
written by seems to use UTF-8. Can you change that? If this
is a test program with the 'äöüß' hard-coded into it you just
may have to get your editor to use ISO-8859-1. Writing to the
HDF5 file isn't the point, it just contains what you told it
to, the problem is passing in the values you want it to con-
tain.

I've tested the code with ISO-8859-1 (change my source code, because the chars are hardcoded), than the encoding is correct (C++ & Matlab). But my problem I read XML data with utf-8 encoding (some cyrillic chars, etc). I would like to write them to the HDF and read them into Matlab. Do you have any idea with a correct encoding?

If I read / write the char data in my code, I can set the correct encoding representation, but is there an option for saving the encoding into the file, because only the char code isn't unique. I can set utf-8, utf-8, ASCII, etc

  Would the H5Tset_cset() routine (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5T.html#Datatype-SetCset) help out here?

Hi Phil,

···

On Apr 26, 2011, at 2:58 PM, Kraus Philipp wrote:

Am 26.04.2011 um 15:34 schrieb Quincey Koziol:

Hi all,

On Apr 26, 2011, at 6:12 AM, Kraus Philipp wrote:

Am 26.04.2011 um 12:52 schrieb Jens Thoms Toerring:

It seems that my code create 2 bytes for the ä, ö, ü, ß and Matlab 1 byte.
Can I switch the encoding in the HDF5 file or can I use unicode or anything
else?

Don't know if you can get MatLab to use UTF-8, concerning the
HDF5 file the question is how it is written. The program it's
written by seems to use UTF-8. Can you change that? If this
is a test program with the 'äöüß' hard-coded into it you just
may have to get your editor to use ISO-8859-1. Writing to the
HDF5 file isn't the point, it just contains what you told it
to, the problem is passing in the values you want it to con-
tain.

I've tested the code with ISO-8859-1 (change my source code, because the chars are hardcoded), than the encoding is correct (C++ & Matlab). But my problem I read XML data with utf-8 encoding (some cyrillic chars, etc). I would like to write them to the HDF and read them into Matlab. Do you have any idea with a correct encoding?

If I read / write the char data in my code, I can set the correct encoding representation, but is there an option for saving the encoding into the file, because only the char code isn't unique. I can set utf-8, utf-8, ASCII, etc

  Would the H5Tset_cset() routine (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5T.html#Datatype-SetCset) help out here?

Thanks, but if I use this call:

H5::StrType l_str(0, p_strlen+1);
l_str.setCset( H5T_CSET_UTF8 );

the dump of a file shows:
      DATATYPE H5T_STRING {
            STRSIZE 29;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }

No exception is thrown or something else. It seems, that the string object does not set the H5T_CSET_UTF8

  Hmm, does the same thing happen with a simple program in C? (I'm looking to determine if the error is in the C++ wrapper, or possibly in the h5dump tool)

  Thanks,
    Quincey