strings in a VDS

Hello all,

I’m having difficulty with incorporating an array of strings within a Virtual Dataset. I’m not sure I should be creating them differently in the original .h5 file or preprocesssing them differently into VDS.

My actual case is more complicated, but the root of the problem lies in this simple example.

f1=h5py.File('teststrings.h5','w',libver='latest')
strings=['hello','worlds']
f1.create_dataset('text',data=strings,shape=(2,))
#this creates a dataset of h5py special type '|O'
#the same thing happens if I explicitly set dtype=h5py.string_dtype(encoding='utf-8')
f1.close()

f3=h5py.File('VDStest.h5','w',libver='latest')
layout=h5py.VirtualLayout(shape=(2,),dtype='str')
vsource=h5py.VirtualSource('teststrings.h5','text',shape=(2,),dtype='str')
layout[:2]=vsource
f3.create_virtual_dataset('text',layout,fillvalue=-1)

This results in a (truncated) error:

File "h5py/h5t.pyx", line 1754, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U')

If I use dtype=‘|O’ in the layout, I get

File "h5py/h5t.pyx", line 1748, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

If I use dtype=‘bytes’, I get

 tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1539, in h5py.h5t._c_string
 ValueError: Size must be positive (size must be positive)

If I attempt to use .asstr() to decode before writing, the virtual source will not create as it complains:

/compat.py", line 19, in filename_encode
    filename = fspath(filename)
               ^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not AsStrWrapper

I’ve tried a number of permutations on the above while reading the forums, documentation, and stack exchange, but something isn’t clicking for me. I’m not sure if I’m just getting caught up in the complicated world of string representations or if this represents a bug with VDS creation. Any input welcome!

If it matters, I am working with a recently constructed conda environment:

>>> print(h5py.version.info)
Summary of the h5py configuration
---------------------------------

h5py    3.9.0
HDF5    1.12.1
Python  3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:10:28) [Clang 15.0.7 ]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.24.4
cython (built with) 0.29.36
numpy (built against) 1.23.5
HDF5 (built against) 1.12.1

Hi @markclaire1,

Did you try the h5py.string_dtype(encoding='utf-8') datatype in the above code? Also, does not make sense to have fill value of -1 (an integer) for a string dataset.

h5py can be thought of as a bridge between NumPy objects and HDF5 data, so it is much better to specify NumPy dtypes rather than basic Python ones, like str or bytes.

Aleksandar

Hi @markclaire1,

The code below works for me:

import h5py


f1 = h5py.File('teststrings.h5', 'w', libver='latest')
strings = ['hello', 'worlds']
dt = h5py.string_dtype(encoding='utf-8')
f1.create_dataset('text', data=strings, shape=(2,), dtype=dt)
f1.close()

f3 = h5py.File('VDStest.h5', 'w', libver='latest')
layout = h5py.VirtualLayout(shape=(2,), dtype=dt)
vsource = h5py.VirtualSource('teststrings.h5', 'text', shape=(2,), dtype=dt)
layout[:2] = vsource
f3.create_virtual_dataset('text', layout)
f3.close()

Setting a fill value for the virtual dataset text gives me this error:

RuntimeError: Unable to get dataset creation properties (address of object past end of allocation)

Without setting a fill value, the virtual dataset reports b'' as its fill value.

Aleksandar

Aleksandar - thank so much for helping me engage with this.
(Amazingly, I had created identical code to your suggestion, after learning that fillvalue was part of the problem, and was just about to report - glad I saw your second post first!)

The only remaining issue that I see is that the identical code with a fillvalue=‘xxx’ yields a complaint:

>>>f3.create_virtual_dataset('text',layout,fillvalue='xxx')
 File "/Users/mclaire/anaconda3/envs/photochem/lib/python3.11/site-packages/h5py/_hl/vds.py", line 233, in make_dataset
    dcpl.set_fill_value(np.array([fillvalue]))
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5p.pyx", line 526, in h5py.h5p.PropDCID.set_fill_value
  File "h5py/h5t.pyx", line 1688, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1754, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U3')

The documentation implies that ‘fillvalue’ is required for create_virtual_dataset, but I’m happy to learn that it is optional, as all of these strings will be present in the datasets I wish to concatentate. My original problem is solved.

Is it worth me reporting an issue for including a specific fillvalue for strings? I ask because maybe you might know a simple workaround that someone else might benefit from?

Thanks again,

Mark

The error I reported is when using something like b'xxx' for the fill value. Below is the full libhdf5 error trace:

HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 0:
  #000: h5py/hdf5/src/H5D.c line 775 in H5Dget_create_plist(): unable to get dataset creation properties
    major: Dataset
    minor: Can't get value
  #001: h5py/hdf5/src/H5VLcallback.c line 2458 in H5VL_dataset_get(): dataset get failed
    major: Virtual Object Layer
    minor: Can't get value
  #002: h5py/hdf5/src/H5VLcallback.c line 2427 in H5VL__dataset_get(): dataset get failed
    major: Virtual Object Layer
    minor: Can't get value
  #003: h5py/hdf5/src/H5VLnative_dataset.c line 469 in H5VL__native_dataset_get(): can't get creation property list for dataset
    major: Dataset
    minor: Can't get value
  #004: h5py/hdf5/src/H5Dint.c line 3665 in H5D_get_create_plist(): datatype conversion failed
    major: Dataset
    minor: Can't convert datatypes
  #005: h5py/hdf5/src/H5T.c line 5308 in H5T_convert(): datatype conversion failed
    major: Datatype
    minor: Can't convert datatypes
  #006: h5py/hdf5/src/H5Tconv.c line 3326 in H5T__conv_vlen(): can't read VL data
    major: Datatype
    minor: Read failed
  #007: h5py/hdf5/src/H5Tvlen.c line 840 in H5T__vlen_disk_read(): unable to get blob
    major: Datatype
    minor: Can't get value
  #008: h5py/hdf5/src/H5VLcallback.c line 7396 in H5VL_blob_get(): blob get failed
    major: Virtual Object Layer
    minor: Can't get value
  #009: h5py/hdf5/src/H5VLcallback.c line 7367 in H5VL__blob_get(): blob get callback failed
    major: Virtual Object Layer
    minor: Can't get value
  #010: h5py/hdf5/src/H5VLnative_blob.c line 119 in H5VL__native_blob_get(): unable to read VL information
    major: Virtual Object Layer
    minor: Read failed
  #011: h5py/hdf5/src/H5HG.c line 560 in H5HG_read(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #012: h5py/hdf5/src/H5HG.c line 235 in H5HG__protect(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #013: h5py/hdf5/src/H5AC.c line 1277 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #014: h5py/hdf5/src/H5Centry.c line 3126 in H5C_protect(): can't load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #015: h5py/hdf5/src/H5Centry.c line 1025 in H5C__load_entry(): invalid len with respect to EOA
    major: Object cache
    minor: Bad value
  #016: h5py/hdf5/src/H5Centry.c line 945 in H5C__verify_len_eoa(): address of object past end of allocation
    major: Object cache
    minor: Bad value

Your error comes from NumPy’s conversion of Python strings to its own string datatype which is not compatible with HDF5 string datatypes, so h5py throws an exception.

Is it worth me reporting an issue for including a specific fillvalue for strings? I ask because maybe you might know a simple workaround that someone else might benefit from?

Yes. I have not used virtual datasets so don’t know if there is a workaround. Also, not sure how common is working with string virtual datasets and h5py.

Aleksandar