Variable length list of strings


#1

Hi all,

I have a dataset of chemical data that requires ragged arrays. Dealing with the ragged arrays for numerical values was accomplished via creating a new datatype:

dt = h5py.vlen_dtype(np.dtype('float64'))

However, when processing the atomic symbols, I get a TypeError:

TypeError: Can't implicitly convert non-string objects to strings

It happens when I attempt to create a dataset with the following values (3 shown for example):

 array(['O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O',
        'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H',
        'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H',
        'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H'], dtype='<U1')
 array(['O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O',
        'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H',
        'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H',
        'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O',
        'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H',
        'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H',
        'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O',
        'H', 'H', 'O', 'H', 'H'], dtype='<U1')
 array(['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'H', 'H', 'H', 'H', 'H',
        'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'], dtype='<U1')

I have a custom datatype string_dt = h5py.special_dtype(vlen=str), but I still get the TypeError no matter what. My use case requires ragged arrays, so any help would be appreciated!

Ray


#2

Hello @rayschireman,

What you are trying to store is an array of one character strings which is different than one variable-length string. Below is a sample code to achieve what you want:

import h5py
import numpy as np
import string


vlength = [3, 8, 6, 4]
dt = h5py.vlen_dtype(np.dtype('S1'))
with h5py.File('char-ragged-array.h5', 'w') as f:
    dset = f.create_dataset('ragged_char_array', shape=(len(vlength),), dtype=dt)
    for _ in range(len(vlength)):
        dset[_] = np.random.choice(list(string.ascii_lowercase),  size=vlength[_]).astype('S1')

The output of the h5dump command for the created file:

GROUP "/" {
   DATASET "ragged_char_array" {
      DATATYPE  H5T_VLEN { H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }}
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): ("z", "j", "w"), ("m", "n", "h", "h", "v", "a", "b", "b"),
      (2): ("s", "u", "g", "q", "r", "s"), ("j", "u", "f", "m")
      }
   }
}
}

Note that upon reading these one character ragged arrays, their elements will be bytes objects not str.

Take care,

Aleksandar