How to read Variable String with UTF8 encoding


#1

I am attempting to read a data set with the following data type:

String,
Length=variable
padding= H5T_STR_NULLTERM
cset = H5T_CSET_UTF8

… according to HDFView version 3

The actual data as displayed by HDFView is “Aluminum” but when I attempt to read it I get 5 bytes worth of junk.

Here is the code that I used to read the data:

inline herr_t readStringDataset(hid_t locationID, const std::string& datasetName, char* data)
{
  H5SUPPORT_MUTEX_LOCK()

  hid_t datasetID; // dataset id
  hid_t typeID;    // type id
  herr_t error = 0;
  herr_t returnError = 0;

  datasetID = H5Dopen(locationID, datasetName.c_str(), H5P_DEFAULT);
  if(datasetID < 0)
  {
    std::cout << "H5Lite.cpp::readStringDataset(" << __LINE__ << ") Error opening Dataset at locationID (" << locationID << ") with object name (" << datasetName << ")" << std::endl;
    return -1;
  }
  typeID = H5Dget_type(datasetID);
  if(typeID >= 0)
  {
    error = H5Dread(datasetID, typeID, H5S_ALL, H5S_ALL, H5P_DEFAULT, data);
    if(error < 0)
    {
      std::cout << "Error Reading string dataset." << std::endl;
      returnError = error;
    }
    CloseH5T(typeID, error, returnError);
  }
  CloseH5D(datasetID, error, returnError, datasetName);
  return returnError;
}

Any help would be appreciated.


Mike Jackson


#2

Mike, a dataset of vlen strings is represented in memory as an array of pointers (=char**). I believe your data argument has the wrong type, and you should make sure to size that pointer array accordingly. G.


#3

The issue ended up being we were not querying the character set type from the data set that was going to be read. Once we got that value, set it into the datasetId and then read the string data things started working correctly.


#4

Interesting. Since ASCII is a proper subset of UTF-8, for something like “Aluminum”, I would not have expected any difference between the byte representations of the ASCII- and UTF-8 -encoded variants. Off the top of my head, I don’t know how H5Dread behaves if you try to read a dataset of UTF-8-encoded strings into a dataset of ASCII-encoded strings. For vlen strings, the library may just give you the bytes and you are free to misinterpret them. For fixed-length strings there is a potential truncation issue for multi-byte code points. I don’t know if the library just copies what fits, or if that’d be treated as a datatype conversion error. G.