NULLPAD & NULLTERM strings


#1

Hello all,

This is a re-ask of a question with the same title from couple of years ago which was not answered.

I need to write fixed length strings with a padding of H5T_STR_NULLTERM. I have tried using both the high and low level APIs, but unable to get the desired output. I suspect I’m missing something obvious, it’s just not obvious to me :man_shrugging:.

What I want to get:

HDF5 "desired.hdf" {
GROUP "/" {
   GROUP "grp1" {
      ATTRIBUTE "A1" {
         DATATYPE  H5T_STRING {
            STRSIZE 1;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "5", "1", "2"
         }
      }
   }
}
}

With the high level API, the fixed length strings have STRPAD value of H5T_STR_NULLPAD, cannot find a way to change that.

With the low level API, the padding is correct, but the strings aren’t written correctly:

HDF5 "low_level.hdf" {
GROUP "/" {
   GROUP "grp1" {
      ATTRIBUTE "A1" {
         DATATYPE  H5T_STRING {
            STRSIZE 1;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "", "", ""
         }
      }
   }
}
}

Setup: OSX Catalina 10.15.7, Python 3.7.0, h5py 3.2.1

Code to reproduce:

import numpy as np
import h5py

x = 512
x_arr = np.frombuffer(str(x).encode("UTF-8"), dtype="|S1")
attribute_name = 'A1'

type_id = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
type_id.set_size(1)
type_id.set_strpad(h5py.h5t.STR_NULLTERM)
space = h5py.h5s.create_simple((len(x_arr),))

file_name = 'low_level.hdf'
with h5py.File(file_name, "a") as f:
    grp = f.create_group('grp1')
    aid = h5py.h5a.create(grp.id, attribute_name.encode('utf-8'), type_id, space)
    aid.write(x_arr)

file_name = 'high_level.hdf'
with h5py.File(file_name, "a") as f:
    grp = f.create_group('grp1')
    grp.attrs[attribute_name] = x_arr

Thanks


#2

When you use the low-level API, I think that HDF5 ends up converting your null-padded strings in the NumPy array to null-terminated strings in the file. That means it adds the trailing null to terminate the string. But you’ve only given it one byte for each string, so they’re all truncated to empty strings.

So one possible fix is to leave space for the null terminator (1 extra byte per string). So if each string is 1 byte:

type_id.set_size(2)

The docs for H5Tset_size say that:

The size set for a string datatype should include space for the null-terminator character, otherwise it will not be stored on (or retrieved from) disk.

Your ‘what I want’ example has ‘null terminated’ strings without any termination. That’s probably not great - this kind of thing can trigger buffer overflow bugs. But if the file is going to be read by something you can’t change, and it absolutely insists on ‘null terminated’ strings without nulls, you don’t have much choice. The trick is to tell HDF5 that the data in memory is already in the format you want, so it doesn’t need to do a conversion:

aid.write(x_arr, mtype=aid.get_type())

mtype is short for ‘memory datatype’, i.e. you’re claiming that the raw data in the Numpy array matches the given HDF5 datatype. Be careful when using this: h5py doesn’t check it against the array, so you can get garbled data or even segfaults if you claim something that’s not true.


#3

Thanks for the quick response.

Looks like the second, unrecommended, option is what I’m looking for. I understand it is dangerous usage. Setting the size to 2 makes sense and would be the solution I’d adopt. Unfortunately, the fixed string length needs to be 1, a constraint associated with the file format (Imaris microscopy), so I don’t have much of a choice.