I would like to ask for support for missing data, i.e. to have an array of logical size N_1, which contains only N_2 < N_1 items and the remaining array entries are masked, typically because the data is missing.
This would be helpful for at least two use cases:
Experimental measurements where data is missing, e.g. Electron backscatter diffraction (EBSD). Ping @mike.jackson
Modular simulation tools, where output exists only on part of the domain, e.g. when locally a refined model is used.
NumPys masked array (Masked arrays — NumPy v1.26 Manual) offers related functionality, but the masked data is actually there and can be accessed, so storing a masked array can be done in HDF5 with two arrays (one with values and one with Booleans), potentially lumped together as a compound type. My proposal differs with respect to that in the way that only the existing data is stored to reduce the memory footprint. So the idea could be also seen as some way of storing sparse matrices (see Sparse matrix - Wikipedia).
The advantageous over hand-written solutions would be that HDF5 could use the most appropriate data structure/layout in the backend.
Revise the filter plugin framework for handling using a dynamically loaded filter with a static library. Currently, the HDF5 library has an error stack that is not obvious when this happens, so we get a lot of questions about what is happening. From a discussion with Jordan,
The last I looked at it, my understanding is that the fact that HDF5 plugins are linked in a shared fashion against HDF5 means that use of them with static builds causes two copies of the library to exist in memory, one static and one shared. Since the plugins use public symbols from HDF5 that have global state, such as H5I IDs, this causes issues where the plugins try to use memory from an uninitialized (shared) HDF5, while the IDs they’re trying to lookup exist in the memory space of the static library.
…
I believe most plugin frameworks don’t have this issue because they either don’t expose symbols used by the library a plugin is linking to or expose symbols that have constant values that don’t depend on a library being initialized to look up a value.
…
I’ll have to prototype it, but the ID lookups and the global state are the main issues with this not working correctly. If a plugin could call back into the library to resolve an ID, most problems would be fixed.
from the C-interface macros remove the references to H5CHECK and H5OPEN and require that users explicitly open the interface as in Fortran.
Add a mechanism so that applications can avoid closing the interface inappropriately. For example, developers writing parts of larger application should be able to open and close the interface in their respective parts without stepping on each other.
Would it be possible to open the library to more use cases than file writing? I have HDF5 as a serialization library in mind. Writing/reading to/from memory is possible but the library was not really designed for that use case.
What about adding a HDF5 schema part to the library? This would help to separate content and schema, the later could be transferred separately, especially useful for the serialization use case.
I am thinking out loud and might be asking for HDF6…
I see serialization as the layer between the data and the transport layer but the ultimate goal is indeed to transfer data from one place to another continuously, aka streaming.
HDF5 itself does not have to implement streaming. It could provide this serialization layer (e.g. as an alternative to msgpack, CBOR…). I think the HDF5 data model is well suited for scientific data (with data compression, self-described data, etc…). Maybe not the file format in his current form though as it cannot be written in a sequential way (e.g. requires to update blocks).
Hoping this clarify a bit and is not to far fetched…
I think HDF5 streaming is a good topic for discussion at HUG25. Do you want to organize it or know someone who does? There have been several attempts at this, e.g., SWMR(-VFD), xrootd, incremental HDF5 File Image transfers, etc. Perhaps, it’s less of a file format question and more of a file space management question, but that’d be jumping to conclusions without an actual use case. What use case(s) did you have in mind?
I doubt that there’s a lot of appetite for library-level schematization. Self-description is precisely about not separating data and meta-data, although there might be good reasons for maintaining external descriptions (such as hdf5-json) to accommodate the notion of a schema. Of course, that creates a coherence problem, but I don’t think that we want to drag that into the library.
When I think of a schema-like construct for HDF5, I’m reminded of RDF shape expressions. The “structural openness” that comes with HDF5 isn’t easy to capture in more prescriptive terms.
Another difficulty is the level-of-detail you want to capture in a schema. There would be a reasonable expectation that the assertion of (in-)equality of two schemata would be meaningful. The h5diff-saga is the perfect example that illustrates the quagmire you get into, unless you are systematically ambiguous about the meaning of =, at all levels. And, unfortunately, datatype convertibility alone can’t settle the matter.
I agree that streaming would be nice topic for HUG25. I wont be able to host it though.
Thanks for the head up on the exploratory work!
Regarding a schema-like construct for HDF5, now that I think of it, an HDF5 file with empty data could be the schema, no need to introduce another description. I would think that strict comparison/checking is the way to go.
My use case would be transferring data of a running acquisition from receivers to an another process that would combine/process/save the data.
Using ‘in-memory files’ (is this possible from the C API?):
Create and transfer an hdf5 file with just the data structure and attributes (as a sort of schema)
Transfer partial (incremental) datasets with additional dataspace information (somewhat similar to a VFD but the other way around).