I’ve seen some posts about memory mapping, and I’ve been thinking about how users may be using this.
Abstracting this, what would be useful is separation between the layout and actual reading and writing of the data. Working near instrumentation and data acquisition, often I have my files will have exactly the same layout and the data will be much larger than the metadata.
Thus, I would like to have layout phase where I just describe the type and shape of the attributes and data.
At some later point in time, I would like to actually write the data without changing the layout. The layout and metadata (and perhaps compact datasets) become read-only while the data can be modified in place. I think this could greatly simplify parallel I/O. If I know that the layout is immutable, then it should be relatively safe to memory map the data portion of the file. This memory mapped data section can then be easily interfaced with using multithreading.
Utilities such as
h5ls do help to elucidate the file structure and layout. Ultimately I would like to see made more accessible via the API. Perhaps one gripe I’ve observed is how the tools often end up using private APIs. It would be great if all the tools could be made to use only public APIs. If additional APIs need to be made public to support the tools, then I think this would be an overall benefit. In part, this is driving my interest in
An overall concept is to make it very clear to the user about where the data lives in the file and how it is stored. While the HDF5 specification is open, it can be very hard for a normal user to interpret, forcing reliance on the API for all I/O.