Variable-Length Data in HDF5 Sketch RFC Status?


Some of the storage issues around variable length data in hdf5 have recently reared their head for our project. From the RFC 2019-07-15 “Variable-Length Data in HDF5 Sketch Design”, it seems like this is a known problem, with a proposed solution.

Has there been further discussion or progress on that RFC?


There has been no recent progress on this, but it’s something we’re considering for HDF5 2.0.


In HSDS we used a different approach for storing variable length data… when a variable-length in-memory array is persisted it is stored with a 4-byte element for the first element, the element, count for next element, and so on (I first came across these scheme using Visual Basic!). When reading from storage the process is reversed.

This has worked fine for HSDS - it might be a good library solution as well but someone with more knowledge of HDF5 library internals would need to look into it.


How do you handle modifications?


When a chunk is read into memory each element becomes a reference to the heap. Then modifications can be made in the normal way. After that the chunk gets re-serialized with adjusted counts as needed.