HDF5 file data.shape inconsistent with dataset shape

First post, so please excuse my ignorance. I can’t seem to find an answer to my problem through searching. Can someone steer me in the right direction?

I have a simple script that runs in parallel through mpi4py. I create a random length array between 2 and 10 elements long of integers (also random) in the range (1,100). Being that this is in parallel (ran from terminal using mpirun -np ...), each rank contains its own random length array. I then create an hdf5 file and create a group for each rank, and dataset for each group (rank). I set their shape to be the maximum of all the rank’s array lengths. Then the datasets get data assigned to the 0->length indices and the rest is left as zeros. I then resize the datasets so that the padding zeros are no longer kept.

Prior to closing the file, I print out the rank number, max data length, length of array on rank, the data array, and the dataset (after being resized). The output is shown below:

user123:~> mpirun -np 4 python3 demo_h5.py
rank max_length length data dataset
3 8 8 [69 87 84 17 25 17 9 70] [69 87 84 17 25 17 9 70]
2 8 8 [69 70 1 70 89 71 15 84] [69 70 1 70 89 71 15 84]
0 8 4 [55 45 50 12] [55 45 50 12]
1 8 5 [23 23 23 39 78] [23 23 23 39 78]

The Issue
I then take a look at the saved file with h5dump to see if the datasets are the same as printed above. Note that rank 0 should have a shape of (4,) but does not. It’s still padded with zeros. My python script is attached bottom of this post for reference, along with my version info.

Help?
I must be missing something. I’m attempting to write varying length arrays in parallel. I don’t want them padded. Surely there’s a straightforward way to do this.

user123:~> h5dump parallel_test.hdf5
HDF5 “parallel_test.hdf5” {
GROUP “/” {
GROUP “rank0” {
DATASET “data” {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 8 ) / ( H5S_UNLIMITED ) }
DATA {
(0): 55, 45, 50, 12, 0, 0, 0, 0
} } }
GROUP “rank1” {
DATASET “data” {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
DATA {
(0): 23, 23, 23, 39, 78
} } }
GROUP “rank2” {
DATASET “data” {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 8 ) / ( H5S_UNLIMITED ) }
DATA {
(0): 69, 70, 1, 70, 89, 71, 15, 84
} } }
GROUP “rank3” {
DATASET “data” {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 8 ) / ( H5S_UNLIMITED ) }
DATA {
(0): 69, 87, 84, 17, 25, 17, 9, 70
} } } } }

>>> print(h5py.version.info)
Summary of the h5py configuration
<--------------------------------->
h5py 3.10.0
HDF5 1.12.0
Python 3.9.12 (main, May 6 2022, 16:05:36)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-4)]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.26.1
cython (built with) 0.29.36
numpy (built against) 1.19.3
HDF5 (built against) 1.12.0

demo_h5.py (820 Bytes)

Hi,

First of all, thank you for attaching the sample Python script and interesting output.

I ran the same test script using GitHub Action + the latest h5py + hdf5 develop on ubuntu-latest.
Here’s my output with 2 ranks:

rank  max length   lenth  data	 dataset
1 	 8 	    5  	  [28 24 69 30 58] [28 24 69 30 58]
0 	 8 	    8  	  [85 10  6 52 21 27 56 88] [85 10  6 52 21 27 56 88]
HDF5 "parallel_test.hdf5" {
GROUP "/" {
   GROUP "rank0" {
      DATASET "data" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 8 ) / ( H5S_UNLIMITED ) }
         DATA {
         (0): 85, 10, 6, 52, 21, 27, 56, 88
         }
      }
   }
   GROUP "rank1" {
      DATASET "data" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 5 ) / ( H5S_UNLIMITED ) }
         DATA {
         (0): 28, 24, 69, 30, 58
         }
      }
   }
}
}

I don’t see any inconsistency.
Thus, can you try with the latest HDF5 and h5py?

Here’s my Action script and output that can help you duplicate my result easily.

P.S. I wish I could run the test with 4 but GitHub has limits.