Compound data type with zero-sized dimension


#1

I found h5py allows one to create a dataset with zero-sized dimension.
For example,

h5f = h5py.File(“dummy.h5”, ‘w’)
buf = np.empty(shape=(2,0))
ds = h5f.create_dataset(’/var1’, data=buf)

But h5py does not allow to create a compound data type with zero-sized dimension. Is this feature not supported yet? I am using h5py 3.1.0.

The codes below error out at h5f.create_dataset().

dt = np.dtype([(‘x’, ‘float’, (2,0))])
ds = h5f.create_dataset(’/var2’, dtype=dt, data=(buf))

The error messages are:

Traceback (most recent call last):
File “./create_hdf5.py”, line 14, in main
ds = h5f.create_dataset(’/var2’, dtype=dt, data=(buf))
File “/python3.6/site-packages/h5py/_hl/group.py”, line 148, in create_dataset
dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
File “/python3.6/site-packages/h5py/_hl/dataset.py”, line 89, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File “h5py/h5t.pyx”, line 1629, in h5py.h5t.py_create
File “h5py/h5t.pyx”, line 1653, in h5py.h5t.py_create
File “h5py/h5t.pyx”, line 1680, in h5py.h5t.py_create
File “h5py/h5t.pyx”, line 1586, in h5py.h5t._c_compound
File “h5py/h5t.pyx”, line 1653, in h5py.h5t.py_create
File “h5py/h5t.pyx”, line 1685, in h5py.h5t.py_create
File “h5py/h5t.pyx”, line 1477, in h5py.h5t._c_array
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py/h5t.pyx”, line 330, in h5py.h5t.array_create
ValueError: Zero-sized dimension specified (zero-sized dimension specified)


#2

When h5py exception text comes with two-part messages like this one it typically means that is an error raised by the HDF5 library. I get the following for your example and h5py code from the master branch:

HDF5-DIAG: Error detected in HDF5 (1.13.1) thread 0:
  #000: /Users/ajelenak/Documents/h5py/hdf5/src/H5Tarray.c line 102 in H5Tarray_create2(): zero-sized dimension specified
    major: Invalid arguments to routine
    minor: Bad value
Traceback (most recent call last):
  File "/Users/ajelenak/Documents/h5py/trt.py", line 9, in <module>
    ds = f.create_dataset('/var2', dtype=dt, data=buf)
  File "/Users/ajelenak/Documents/h5py/h5py/h5py/_hl/group.py", line 161, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/Users/ajelenak/Documents/h5py/h5py/h5py/_hl/dataset.py", line 88, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1663, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1687, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1714, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1620, in h5py.h5t._c_compound
  File "h5py/h5t.pyx", line 1687, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1719, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1511, in h5py.h5t._c_array
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5t.pyx", line 347, in h5py.h5t.array_create
ValueError: Zero-sized dimension specified (zero-sized dimension specified)

-Aleksandar


#3

Will HDF5 support this feature in the future?


#4

Hi @wkliao,

It is in fact possible to create a compound dataset with zero-sized dimension. It could be that your code as an issue or, less likely, h5py has a bug.

Here is a Python script that demonstrates such compound using HDFql (do not know much about h5py - sorry):

# import HDFql module (make sure it can be found by the Python interpreter)
import HDFql

# create an HDF5 file named 'test.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE test.h5")

# create a compound dataset with zero-sized dimension
HDFql.execute("CREATE DATASET dset AS COMPOUND(m1 AS INT, m2 AS FLOAT)(0)")

After running this script, you should have a file named test.h5 containing a compound dataset named dset. When running h5dump on it, the output is as follows:

HDF5 "test.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_COMPOUND {
         H5T_STD_I32LE "m1";
         H5T_IEEE_F32LE "m2";
      }
      DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
      DATA {
      }
   }
}
}

#5

I was actually hoping for the followings from h5dump. In this example, there are 2 two-dimensional arrays in the compound data type. The second has a zero-sized dimension. I am not familiar with HDFql. Could you try to create such an example?

HDF5 "test.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_COMPOUND {
         H5T_ARRAY { [2][2] H5T_IEEE_F32LE } "arr1";
         H5T_ARRAY { [2][0] H5T_IEEE_F32LE } "arr2";
      }
      DATASPACE  SCALAR
   }
}
}

#6

Hi @wkliao,

Just tried with HDFql and, unfortunately, it seems to be not possible to create a member (of a compound dataset) with a zero-sized dimension. The following error message is returned when trying to do this:

HDF5-DIAG: Error detected in HDF5 (1.8.22) thread 0:
  #000: H5Tarray.c line 126 in H5Tarray_create2(): zero-sized dimension specified
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.22) thread 0:
  #000: H5Tcompound.c line 350 in H5Tinsert(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type

Probably there is a reason that explains why it is possible to create a dataset with a zero-sized dimension while it is not for a member (of a compound dataset) but not sure what it could be.


#7

As far as I recall, the H5T_ARRAY datatype generator never supported zero-sized (“degenerate”) extents. Do you have a use case?

G.


#8

In my case, a simulation program creates a file containing many 2D datasets organized into groups. The size of 2nd dimension of the datasets can be larger than or equal to zero. The simulation output file is used in a successive ML application, where all datasets will be read into numpy arrays or torch tensors. I would like to change the storage layout to use compound data types, i.e. one compound data type for each group.


#9

Without knowing what kind of data those datasets store and what is the physical meaning of their dimensions it is difficult to offer any suggestion.

The HDF5 array datatype cannot allow zero-sized dimensions because how many bytes each dataset element takes is critical information and there cannot be resizing array datatype afterward.

-Aleksandar


#10

Here is an example output of a compound data type. In this case, no array has zero-sized dimension. In other cases, some arrays may have zero-sized dimension, for example, edge_index_3d_u can be of dimension size [2][0]. Whether an array has zero-sized dimensions is determined by the simulation. The structure of the compound type remains the same for all datasets, except for the 2nd dimension size of each array.

HDF5 "/scratch/train.h5" {
GROUP "/" {
   DATASET "0" {
      DATATYPE  H5T_COMPOUND {
         H5T_ARRAY { [2][2939] H5T_STD_I64LE } "edge_index_3d_u";
         H5T_ARRAY { [2][2879] H5T_STD_I64LE } "edge_index_3d_v";
         H5T_ARRAY { [2][708] H5T_STD_I64LE } "edge_index_3d_y";
         H5T_ARRAY { [2][3386] H5T_STD_I64LE } "edge_index_u";
         H5T_ARRAY { [2][2960] H5T_STD_I64LE } "edge_index_v";
         H5T_ARRAY { [2][1936] H5T_STD_I64LE } "edge_index_y";
         H5T_STD_I64LE "n_sp";
         H5T_ARRAY { [568][9] H5T_IEEE_F32LE } "x_u";
         H5T_ARRAY { [497][9] H5T_IEEE_F32LE } "x_v";
         H5T_ARRAY { [327][9] H5T_IEEE_F32LE } "x_y";
         H5T_ARRAY { [568] H5T_STD_I64LE } "y_i_u";
         H5T_ARRAY { [497] H5T_STD_I64LE } "y_i_v";
         H5T_ARRAY { [327] H5T_STD_I64LE } "y_i_y";
         H5T_ARRAY { [568] H5T_STD_I64LE } "y_s_u";
         H5T_ARRAY { [497] H5T_STD_I64LE } "y_s_v";
         H5T_ARRAY { [327] H5T_STD_I64LE } "y_s_y";
      }
      DATASPACE  SCALAR
   }

#11

Hi @wkliao,

Thanks for sharing your output’s format. It is quite unconventional for me. I’d turn each of the H5T_ARRAY fields into a separate HDF5 dataset – /0/edge_index_3d_u, /0/edge_index_3d_v, etc. – because this enables you to resize the datasets based on the simulation results. And h5py gives you a NumPy array for each of the datasets. Seems like you have an ephemeral need for the data in HDF5 between simulations and ML training.

-Aleksandar


#12

Storing the data with zero-sized dimension as separate datasets will require to add lots of checks and increase the total number of datasets in a file. I prefer not to add such overheads. Maybe I should submit a feature request in HDF5 github repo?


#13

What would you request? Zero-dimension H5T_ARRAY datatype?


#14

Yes. I assume it is currently not supported.


#15

@wkliao, from your example train.h5, it looks like your simulation generates arrays of many different sizes. I suggest the natural way to organize this in HDF5 is to use HDF5 groups containing datasets, not compound data types containing arrays. Groups and datasets will support the zero-size arrays that you need.


#16

My program has already been using the group/dataset approach.
Switching to compound data type was suggested by @steven
in this discussion which can improve the I/O performance for me.