HDF5: how to set H5T_STRING to use H5T_STR_SPACEPAD using h5py?

I use a commercial tool, written in Fortran, that creates an HDF5 results file.
This results file can then be visualized in a related tool written in C++ and Java,
possibly linked to the same Fortran library.

I have reverse-engineered most of the HDF5 format so that I can visualize my own data.
Unfortunately the visualization tool doesn’t display my own data, but gives no error.

According to h5dump, the only systematic difference between the two files is the
STRPAD option of strings used for both attribute and dataset definitions.

The original file format, produced by someone else’s Fortran has entries like:

    DATATYPE  H5T_STRING {
        STRSIZE 100;
        STRPAD H5T_STR_SPACEPAD;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
     }

and my own file, produced using Python-3.6.8 and h5py-2.9.0 has:

    DATATYPE  H5T_STRING {
        STRSIZE 100;
        STRPAD H5T_STR_NULLPAD;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
    }

The Low-Level API (Low-Level API — h5py 3.12.1 documentation) appears
to say that I could use class h5py.h5t.TypeStringID / set_strpad() but I have
so far failed to work out how.

I experimented with variations of things in the “Strings in HDF5” section of
the online docs above, and also in the “Python and HDF5” book by A.Collette,
but without success, and temporarily went back to the incorrect but simpler:

    grp.attrs['Description'] = numpy.bytes_("%-100s" % "Description")

Can someone provide an example of the correct way to set the Description
attribute so that it uses H5T_STR_SPACEPAD either via ‘standard’ h5py or
using the low level API?

UPDATE:

I’ve solved most of it, with one remaining problem…

The question above has been “on hold” in Draft mode after the Forum system
suggested that the topic is similar to the following article which I had to check:

As a result, I can replace my old High-Level API code such as the following:

    hdf = h5py.File("test.h5", "w")
    hdf.attrs['Title'] = numpy.array(numpy.bytes_("%-24s" % "Introduction"), ndmin=1)

with the Low-Level API code at the start of the following and get the STR_SPACEPAD
for the “Title” attribute, but I still get STR_NULLPAD for the fields of the Compound Type:

    hdf = h5py.File("testing.h5", "w")

    ascii24 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    ascii24.set_size(24)
    ascii24.set_strpad(h5py.h5t.STR_SPACEPAD)
    dataspace = h5py.h5s.create_simple((1, ), (1, ))

    attribute = h5py.h5a.create(hdf.id, "Title".encode("ascii"), ascii24, dataspace)
    attribute.write(numpy.array(numpy.bytes_("%-24s" % "Introduction")))

    person_type = h5py.h5t.create(h5py.h5t.COMPOUND, 48)
    person_type.insert("firstName".encode("ascii"), 0, ascii24)
    person_type.insert("lastName".encode("ascii"), 24, ascii24)

    people = hdf.create_dataset("people", shape=(1,), maxshape=(1,), dtype=person_type)
    people[0, "firstName"] = numpy.array(numpy.bytes_("%-24s" % "Abraham"))
    people[0, "lastName"] = numpy.array(numpy.bytes_("%-24s" % "Lincoln"))

which results in the following h5dump testing.h5 listing:

    HDF5 "testing.h5" {
    GROUP "/" {
       ATTRIBUTE "Title" {
          DATATYPE  H5T_STRING {
             STRSIZE 24;
             STRPAD H5T_STR_SPACEPAD;               // This is what I want
             CSET H5T_CSET_ASCII;
             CTYPE H5T_C_S1;
          }
          DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
          DATA {
          (0): "Introduction            "
          }
       }
       DATASET "people" {
          DATATYPE  H5T_COMPOUND {
             H5T_STRING {
                STRSIZE 24;
                STRPAD H5T_STR_NULLPAD;                // This should be H5T_STR_SPACEPAD
                CSET H5T_CSET_ASCII;
                CTYPE H5T_C_S1;
             } "firstName";
             H5T_STRING {
                STRSIZE 24;
                STRPAD H5T_STR_NULLPAD;                // This should be H5T_STR_SPACEPAD
                CSET H5T_CSET_ASCII;
                CTYPE H5T_C_S1;
             } "lastName";
          }
          DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
          DATA {
          (0): {
                "Abraham                 ",
                "Lincoln                 "
             }
          }
       }
    }
    }

Where am I still going wrong when creating or assigning the Compound Type fields?

h5py seems to always save the spacing of a string as NULLPAD, regardless of what is set by the application. Here’s a minimal example that uses a string datatype directly instead of as part of a compound:

with h5py.File("testing.h5", "w") as f:
    ascii24 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    ascii24.set_size(24)
    ascii24.set_strpad(h5py.h5t.STR_SPACEPAD)

    print("Strpad: ", ascii24.get_strpad()) # 2 = SPACEPAD

    dset = f.create_dataset("dset", shape=(1,),
                            maxshape=(1,), dtype=ascii24)
    type_id = dset.id.get_type()
    print("Strpad: ", type_id.get_strpad()) # 1 = NULLPAD

Output:

Strpad:  2
Strpad:  1

Using the HDF5 C API directly to do the same thing, the proper string padding can be retrieved:

int main() {
    hid_t file_id = H5Fcreate("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
    hid_t string_type = H5Tcopy(H5T_C_S1);
    H5Tset_size(string_type, 6);
    printf("H5Tget_strpad: %d\n", H5Tget_strpad(string_type)); // 0 = NULLTERM
    H5Tset_strpad(string_type, H5T_STR_SPACEPAD);
    printf("H5Tget_strpad: %d\n", H5Tget_strpad(string_type)); // 2 = SPACEPAD
    hid_t dataspace_id = H5Screate(H5S_SCALAR);
    hid_t dataset_id = H5Dcreate(file_id, "string", string_type, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    hid_t ret_dtype_id = H5Dget_type(dataset_id);
    H5Tget_strpad(ret_dtype_id);
    printf("H5Tget_strpad: %d\n", H5Tget_strpad(ret_dtype_id)); // 2 = SPACEPAD

    return 0;
}

Output:

H5Tget_strpad: 0
H5Tget_strpad: 2
H5Tget_strpad: 2

If it’s possible for your use case, it looks like using the C library directly would resolve the differences between files.

1 Like

Thanks for your analysis [is it a bug or a feature?] and the C code example which is really helpful.

I had a sneaking suspicion that I wouldn’t be able to do what I wanted in h5py, High or Low Level API, and that I would be obliged to get closer to the bare metal by using C. I use C++ and I’m a bit rusty in C, and obviously haven’t checked out the C API yet, but I think I’ll get there.

Thanks again.

1 Like

I suspect it’s a bug in h5py that results from how numpy stores strings internally, but I can’t say for sure. Glad this helped.

Just to close the circle and provide the last bit of the puzzle in the
hope that it might help some other confused person in the future…

After a bit of going backwards and forwards in the documentation,
where H5T_COMPOUND is introduced but not used in the section
dealing with hyperslab selection, and finally having the light bulb
moment of writing a complete Person struct in one go rather than
trying to update the firstName and lastName fields separately,
I came up with this, which seems to do the final part of the job.

There’s no error checking, nothing is ‘closed’ after use, and
ensuring that the Person fields are space padded and don’t
contain the undesired terminating null byte(s) is a bit clunky [*]

#include <stdio.h>
#include <string.h>
#include "hdf5.h"

int main(int argc, char* argv[])
{
    herr_t status;
    hid_t  file_id  = H5I_INVALID_HID;

    char buffer32[32];  /* bigger than ascii24 ! */
    hid_t ascii24 = H5Tcopy(H5T_C_S1);
    H5Tset_size(ascii24, 24);
    H5Tset_strpad(ascii24, H5T_STR_SPACEPAD);

    file_id = H5Fcreate("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    hid_t title_ds = H5Screate(H5S_SCALAR);
    hid_t title_attr = H5Acreate(file_id, "Title", ascii24, title_ds,
                                 H5P_DEFAULT, H5P_DEFAULT);

    snprintf(buffer32, sizeof(buffer32), "%-24s", "Introduction");
    status = H5Awrite(title_attr, ascii24, (const void *)buffer32);

    typedef struct {
        char firstName[24]; /* ascii24 */
        char lastName[24];  /* ascii24 */
    } Person_t;

    hid_t person_id = H5Tcreate(H5T_COMPOUND, sizeof(Person_t));
    H5Tinsert(person_id, "firstName", HOFFSET(Person_t, firstName), ascii24);
    H5Tinsert(person_id, "lastName", HOFFSET(Person_t, lastName), ascii24);

    Person_t people[1];
    hsize_t people_dims[2];
    people_dims[0] = 1;
    people_dims[1] = 1;
    hid_t people_ds = H5Screate_simple(2, people_dims, NULL);

    hid_t dataset = H5Dcreate(
            file_id, "people", person_id, people_ds,
            H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

    /* avoid copying terminating null byte into fields */
    snprintf(buffer32, sizeof(buffer32), "%-24s", "Benjamin");
    strncpy(people[0].firstName, buffer32, sizeof(people[0].firstName));
    snprintf(buffer32, sizeof(buffer32), "%-24s", "Franklin");
    strncpy(people[0].lastName, buffer32, sizeof(people[0].lastName));

    status = H5Dwrite(dataset, person_id, people_ds, H5S_ALL, H5P_DEFAULT, people);

    return 0;
}

[*] EDIT: In my slightly more complicated real world code, I found that I needed to use
memcpy rather than strncpy to have even more precise control over what is copied.

And of course, here’s the h5dump of the resulting test.h5 file showing that
the firstName and lastName of the people H5T_COMPOUND dataset also
have STR_SPACEPAD attributes, which I couldn’t achieve using h5py.

HDF5 "test.h5" {
GROUP "/" {
   ATTRIBUTE "Title" {
      DATATYPE  H5T_STRING {
         STRSIZE 24;
         STRPAD H5T_STR_SPACEPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "Introduction            "
      }
   }
   DATASET "people" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE 24;
            STRPAD H5T_STR_SPACEPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "firstName";
         H5T_STRING {
            STRSIZE 24;
            STRPAD H5T_STR_SPACEPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "lastName";
      }
      DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
      DATA {
      (0,0): {
            "Benjamin                ",
            "Franklin                "
         }
      }
   }
}
}