Unicode string attributes using HDF.Pinvoke in C# and h5py


#1

Hi everyone,

I am writing string attributes to a HDF5 file in C# using HDF.Pinvoke, and I then need to read these in a h5py application.

First I used something along these lines:

string attributeValueString = "sometext";
hid_t stringId = H5T.copy(H5T.C_S1);
int stringLength = attributeValueString.Length + 1;
H5T.set_size(stringId, new IntPtr(stringLength));
H5T.set_strpad(stringId, H5T.str_t.NULLTERM);
                
byte[] textAscii = ASCIIEncoding.ASCII.GetBytes(attributeValueString);
attributeId = H5A.create(parent.Id, attributeName, stringId, H5S.create(H5S.class_t.SCALAR));
type = H5T.create(H5T.class_t.STRING, new IntPtr(stringLength * sizeof(char)));
attributeClass = H5T.class_t.STRING;

try
{
    handle = GCHandle.Alloc(textAscii, GCHandleType.Pinned);
    status = H5A.write(attributeId, type, handle.AddrOfPinnedObject());
}
finally
{
    handle.Free();
}

This worked fine for ASCII strings (and I can also read these attributes from h5py), but did not suit our needs since we also want to enter some unicode characters, e.g. åäö, so I tried to encode the string to UTF8, since that seems to be what HDF5 can handle. This produced weird characters, and after some trial-and-error I found a hack via UTF32 that works, at least for the characters we are interested in (it will not work for, e.g., Japanese characters, though):

string attributeValueString = "sometext";

// Horribly ugly hack which seems to work:
byte[] textUnicodeTemp = Encoding.UTF32.GetBytes(attributeValueString);
byte[] textUnicode = new byte[textUnicodeTemp.Length / 4];
for (int i = 0; i < textUnicodeTemp.Length/4; i++)
{
	textUnicode[i] = textUnicodeTemp[4 * i];
}

type = H5T.create(H5T.class_t.STRING, new IntPtr(textUnicode.Length));
int result = H5T.set_cset(type, H5T.cset_t.UTF8);
H5T.set_strpad(type, H5T.str_t.NULLTERM);

attributeId = H5A.create(parent.Id, attributeName, type, H5S.create(H5S.class_t.SCALAR));
attributeClass = H5T.class_t.STRING;

try
{
	handle = GCHandle.Alloc(textUnicode, GCHandleType.Pinned);
	status = H5A.write(attributeId, type, handle.AddrOfPinnedObject());
}
finally
{
	handle.Free();
}

When opening the HDF5 file in HDFView, the attribute values are correct, including the Swedish characters åäö. However, when trying to read the file with h5py, e.g. with:

    with h5py.File(filename, "r") as f:
        
        return f["/path/to/dataset"].attrs["attribute-name"]

I get a

"Unable to read attribute (no appropriate function for conversion path)"

I have done some googling and found out that this is because h5py does not accept fixed-length unicode strings. Is there a way to write some kind of UTF/Unicode characters with C#/HDFPinvoke that will be readable from h5py?


#2

The character set is not checked at all by the library (see H5Pset_char_encoding). So you could use H5T_CSET_ASCII the value is returned by H5py as byte array. Then you could try to decode it as UTF-8.