What small change would make a big difference in your life w/ HDF5?
I thought it might be useful to collect a few ideas on simplifying our “HDF5 lives” and present and discuss them during HUG 2021. If you’d like to join in and make your case, please reply to this thread!
(We’ll summarize this thread before the event in a markdown document in a GitHub repo. Better ideas?)
Example: I would like to see an HDF5 boolean datatype. Some would argue that that’s the simplest possible type, and it isn’t easy to understand that it wasn’t part of the early HDF5 datatype canon. More precisely, I would like to see a narrow (bool_t = { T, F }) boolean type and a wide (wbool_t = {T, F, A - missing, I - meaningless}) boolean type.
Chime in! I’m sure there’s no shortage of pain ideas.
Myself, I’d be happy to see some changes to the I/O Filter interface, namely: (1) access to the file handle from the filter, (2) access to the dataspace id, (3) access to the selected hyperslab, and (4) a callback to tell the filter when the dataset is closed.
The use cases I have in mind for those changes are:
Improved performance of filters (e.g., HDF5-UDF) when the application only wants to deal with a subset of the full dataset
To allow filters to retrieve metadata and read other datasets from the file currently being handled
To allow filters to allocate resources and keep them around until the file handle is closed (think of data aggregation for bulk compression and better occupation of hardware-based accelerators, for instance)
Oh, I was unaware of the H5DOread_chunk and H5DOwrite_chunk APIs, thanks for the pointer! Some of the use cases mentioned on my third bullet could use those, indeed. The disadvantage is that existing HDF5 applications would have to be modified (which is the reason why I like filters so much). I’ll take those APIs into consideration next time I touch the relevant code bases.
These extensions as proposed by lucas, particularly #2, would seem useful also for scenarios such as providing a dataset/datatype-specific attributes as parameters to a filter, such as mentioned in Zstd filter plugin & dictionary training . Using chunked read/write appears contra-productive here: These API functions require the application to do all the chunk and data handling, whereas the filter API is supposed to be rather invisible to the application, beyond just passing some parameters to the dataset creation property to enable a specific filter. But specifically on reading, the application does not need to have any clue which filter was used for dataset creation; when going via some chunked reading/writing, the application needs to explicitly reproduce all operations and existing applications require modification (which is not needed via the filter API).
One thing that would be tremendously useful to enhance the usability of filters and easy to implement is to support inclusion of the runtime executable’s path (and paths relative to it) in the HDF5_PLUGIN_PATH .
By default, it’s a static path, such as /usr/local/hdf5 . So all HDF5 applications use the same filters by default, which may or may not work (some filters require a dependency on HDF5, for instance when calling memory management functions), and not all users have admin rights (same issue under Linux as under Windows or other OS’s). An application can extend the HDF5_PLUGIN_PATH with the respective API functions to add runtime-dependent paths relative to the executable’s path, but that does not help existing HDF5 applications such as h5ls or h5dump . They could not benefit from such filters and data become unreadable that have been created by some filter plugins that are not available in those static paths.
It is rather simple to extend the HDF5_PLUGIN_PATH management in the HDF5 library to allow references to a special symbol, e.g. “@0”, to be substituted with the executable’s path at runtime. Such solves all the mentioned issues.
It would be really useful if there was a way to build the HDF5 library (probably plus the high level library) from an amalgamated source. SQLite, zstd and duckdb all have amalgamated versions of the source, which make them really easy to vendor. I don’t know how easy that would be for HDF5 (my guess is not easy), but I think a lot of folks would find it useful.
Personally I’d really like some wrappers on HDF Fortran routines to make reading/writing HDF files more like it is in python - i.e. simple. I know that is pretty non-specific. Given a bit of time I could flesh that out, but anyone who has tried I/O to HDF5 files in both Fortran and python will know what I mean.
As the maintainer of an R / HDF5 interface, I’d love the wbool_t type. In R the logical datatype has value TRUE, FALSE, and NA so the wide version would be appropriate. We currently map R logicals to H5T_STD_I8LE with an attribute to indicate it originated as a logical.
This works in our own ecosystem but is not particularly portable if a file has been written by any other software. It’d be great to have an HDF5 boolean datatype.
h5py records booleans as an enum datatype, based on H5T_NATIVE_INT8. These are strict booleans, i.e. there’s no option to represent missing or meaningless data.
I believe (based on a feature that was requested in h5py) that pytables uses an 8-bit bitfield type for booleans.