What do you want to see in "HDF5 2.0"?

I would like to ask for support for missing data, i.e. to have an array of logical size N_1, which contains only N_2 < N_1 items and the remaining array entries are masked, typically because the data is missing.

This would be helpful for at least two use cases:

  • Experimental measurements where data is missing, e.g. Electron backscatter diffraction (EBSD). Ping @mike.jackson
  • Modular simulation tools, where output exists only on part of the domain, e.g. when locally a refined model is used.

NumPys masked array (Masked arrays — NumPy v1.26 Manual) offers related functionality, but the masked data is actually there and can be accessed, so storing a masked array can be done in HDF5 with two arrays (one with values and one with Booleans), potentially lumped together as a compound type. My proposal differs with respect to that in the way that only the existing data is stored to reduce the memory footprint. So the idea could be also seen as some way of storing sparse matrices (see Sparse matrix - Wikipedia).

The advantageous over hand-written solutions would be that HDF5 could use the most appropriate data structure/layout in the backend.

1 Like

Is this proposal: https://docs.hdfgroup.org/hdf5/rfc/RFC_Sparse_Chunks180830.pdf along the lines of what you are looking for?

1 Like

Revise the filter plugin framework for handling using a dynamically loaded filter with a static library. Currently, the HDF5 library has an error stack that is not obvious when this happens, so we get a lot of questions about what is happening. From a discussion with Jordan,

The last I looked at it, my understanding is that the fact that HDF5 plugins are linked in a shared fashion against HDF5 means that use of them with static builds causes two copies of the library to exist in memory, one static and one shared. Since the plugins use public symbols from HDF5 that have global state, such as H5I IDs, this causes issues where the plugins try to use memory from an uninitialized (shared) HDF5, while the IDs they’re trying to lookup exist in the memory space of the static library.

I believe most plugin frameworks don’t have this issue because they either don’t expose symbols used by the library a plugin is linking to or expose symbols that have constant values that don’t depend on a library being initialized to look up a value.

I’ll have to prototype it, but the ID lookups and the global state are the main issues with this not working correctly. If a plugin could call back into the library to resolve an ID, most problems would be fixed.

1 Like