Memory mapping / Paging via HDF5 & VFD?


Per discussion in today’s HUG21 meeting, it was mentioned that the VFD is currently under revision and reworked, including changes to the HDF5 API. So maybe this could be an opportunity to consider and also build in memory mapping capabilities such that a H5Dread() actually does not read at all, but just exposes a memory location where the file contents are paged in and out? Obviously, that would only work for uncompressed/unfiltered data, and probably require new public HDF5 APIs.

I am thinking about a function like

hid_t H5Dread_mapped(, void**buf);

in contrast to the current hid_t H5Dread(, void*buf); such that the new _mapped function actually provides a pointer to where the data resides accessible in memory, rather than using a user-allocated buffer where data needs to be copied to by the HDF5 library / VFD. Obviously, a VFD needs to support such memory-mapping capabilities as well.

If the memory location is writable, then a modification of data in memory would supersede explicit H5Dwrite() calls. Probably it would require some “touching” functions to tell HDF5 that the respective dataset has been updated, which would at least interfere with the timestamps stored in the HDF5 files as well.

In cases where access to uncompressed, unfiltered data sets or at least chunks therefore is all what is needed, such a memory mapping functionality would be quite useful and should be quite performing, avoiding any memcopy()s.

A similar idea was discussed in HDF5 and memory mapped files on Windows , but that one would memory-map the entire file, and still require copying data at H5Dread() from the memory-mapped file to the application-provided buffer.


I’m also interested in seeing how this could happen.

I think memory mapped datasets have the potential of being quite powerful.

Interested to see how this evolves.


I’m not entirely sure that we need built in memory mapping capabilities. What we really need is the ability to allocate space in the HDF5 file without writing. There is some limited capability to do this by setting the allocation time to early and the fill time to never. However, one needs an interface to allocate or reallocate chunks.

If HDF5 provides offset and size data for datasets or chunks, then it would not be hard to turn file locking off and write or read the raw data via memory mapping.