What do you want to see in "HDF5 2.0"?

I would like to ask for support for missing data, i.e. to have an array of logical size N_1, which contains only N_2 < N_1 items and the remaining array entries are masked, typically because the data is missing.

This would be helpful for at least two use cases:

  • Experimental measurements where data is missing, e.g. Electron backscatter diffraction (EBSD). Ping @mike.jackson
  • Modular simulation tools, where output exists only on part of the domain, e.g. when locally a refined model is used.

NumPys masked array (Masked arrays — NumPy v1.26 Manual) offers related functionality, but the masked data is actually there and can be accessed, so storing a masked array can be done in HDF5 with two arrays (one with values and one with Booleans), potentially lumped together as a compound type. My proposal differs with respect to that in the way that only the existing data is stored to reduce the memory footprint. So the idea could be also seen as some way of storing sparse matrices (see Sparse matrix - Wikipedia).

The advantageous over hand-written solutions would be that HDF5 could use the most appropriate data structure/layout in the backend.

1 Like

Is this proposal: https://docs.hdfgroup.org/hdf5/rfc/RFC_Sparse_Chunks180830.pdf along the lines of what you are looking for?

1 Like