String Structure Setup


#1

Hi,
I’m having trouble reading my data in pandas so I’m trying to replicate the H5 structure that panda can read.

At the moment pandas reads my strings like this: b"test\x00\x00\x00\x00’\x0bv\xdbR\xa5… and I don’t want the padding, I just want to see ‘test’

My current setup with the C lib is this:

    hid_t atype = H5Tcopy(H5T_C_S1);
    H5Tset_size(atype, 36);
    H5Tset_strpad(atype, H5T_STR_NULLTERM);
    H5Pset_char_encoding(atype, H5T_CSET_ASCII);

with the H5 structure looking like this:

DATASET “testdata” {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 36;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} “name”;

              }
              DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
  DATA {
  (0): "test", "one", "two", "three"
  }
 }

However I want the H5 structure to look like this, pandas can interpret this and show no padding:

DATASET “testdata” {
DATATYPE H5T_STRING {
STRSIZE 36;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): “test”, “one”, “two”, “three”
}

Does anyone know how I can setup my H5 structure to mimick this? It would appear that the string type is it’s own datatype? How would I do that with the library?

All the best

/P


#2

Paul, I’m not sure if this is a pandas question or an HDF5 question. A quick recap:
By using a fixed-length string datatype your are instructing HDF5 to allocate (in memory and in the file) 36 bytes per ASCII string. For strings that have fewer “useful” characters some kind of excess or background will be inevitable. You have three options to control that (H5T_STR_NULLTERM, H5T_STR_NULLPAD, H5T_STR_SPACEPAD). How pandas handles this background I don’t know, but I’m sure these guys have some documentation for that. If you don’t want any kind of background, variable-length strings are the other option. And that’s really all there is to it.

BTW, I don’t understand your example. Your dataset contains a single element, but your data dump shows four elements. Be that as it may, it appears that you want to get rid of the single “name”-field compound datatype. That’s fine. Just make the string datatype the dataset’s datatype and off you go. OK?

G.