HDF5 Distributed discussion

@ajelenak, Would this be a good place to have the discussion? or should we start a new thread?

Did we make a recording of this session?

Most of the questions I had were about chunking:

  1. Where should the filter_mask (e.g. as returned from H5Dget_chunk_info be saved? Should that be in the JSON file or should it be a binary blob that is part of the chunk? Alternatively, should we enforce a common filter_mask when exported into a format like this.
  2. Should the chunk keys be indexed by a linear chunk index as in H5Dget_chunk_info, a n-dimensional chunk index as presented during the session, or a logical element offset as in H5Dget_chunk_info_by_coord?
  3. Where should attributes go? Some attributes seem like they could go directly into the JSON file, but the only way to store some values such as a floating point numbers with fidelity is to store as binary. Perhaps a BSON equivalent would be useful as well here?

Great Call the Doctor session today! Here’s the recording:

Video summary:
During our Call the Doctor session on Tuesday 12/20, Aleksandar Jelenak introduced the idea of HDF5 Distributed, a new HDF5 schema for storage systems with key-value interface. Highly Scalable Data Service (HSDS) already uses a schema with very similar features but HDF5 Distributed is aimed for direct access to the storage system. He will also show an example of HDF5 Distributed in an SQLite file.

2 Likes

I think the term ‘HDF5 Distributed’ is misleading. It’s really ‘HDF5 Unboxed’ or ‘HDF5 Un[packed,wrapped]’.

We’re starting to see HDF5 have three distinct manifestations:

  1. The HDF5 file specification that describes the binary format of a particular container on a file system.
  2. The HDF5 application programming interface (API) a collection of C calls to access data. The data may or may not be stored in the HDF5 file specification.
  3. The HDF5 C library a specific implementation of the of the HDF5 API in C.

The HDF5 file specification is quite feature rich, but can be difficult for humans to understand. Exporting this to JSON is useful so that the structure of a HDF5 file can be more easily understood by humans.

The API is useful for programs to access data stored in the HDF5 specification, but it could also be used to access data in other containers or file formats. It would be interesting to see if the HDF5 API could be used to access other kinds of files or containers such as Zarr or N5. My perspective of Zarr and N5 is to mostly thing of them as APIs rather than file formats.

Having the C library around is useful as a reference implementation, but it is also perhaps showing its age in parts. In applications, linking the library can be useful to start interfacing with HDF5 files, but this can also introduce complications. Alternative implementations of the API or subsets of the API would be useful to make HDF5 files more easily accessible.

One point of discussion is https://github.com/JuliaIO/JLD2.jl which a library written in Julia to write and read a subset of the HDF5 file specification. This does not implement the API, but does implement code to read and write a specific kind of HDF5 file. It does not necessarily read generic HDF5 files.