HSDS 0.9 release target

I’m hoping to have an official release of HSDS 0.9 out early this year, so it would be a good idea to take a stab at finalizing what features to include. There’s already a bunch of stuff that is in the master branch, that’s not in the current 0.8.5 release. On top of that there are some other things that would be nice to include but are still in the design phase (meaning that to include them will likely push out the release a bit).

Here’s a feature list going from those that are already implemented to those that are just an idea at this point:

  • Shape reduction: datasets can be reduced in size - implemented
  • Broadcasting: Numpy-style broadcasting of values over the entire selection - implemented
  • UTF8 fixed size strings: Enable UTF8 with a fixed number of bytes - implemented
  • Quickscan: Get domain wide usage (number of objects, storage used, etc.) on demand - implemented
  • n-bit and scale-offset filters: Enable these filters to be specified (though they don’t actually do anything!): implemented
  • bitshuffle filter: Support for bitshuffle (similar to byteshuffle but at bit level - see: GitHub - kiyo-masui/bitshuffle: Filter for improving compression of typed binary data.): implemented, but has an open issue
  • Update for array types: enhanced support for array types - implemented, but still some issues to resolve
  • Fieldops: read/write any subset of fields for a compound type dataset - implemented
  • Support for long attribute names/non-utf8 encodable attributes - implemented
  • Multiop attribute: Read or write multiple attribute from (possibly) multiple objects in one request - implemented
  • Support for long link names/non-utf8 encodable link names - WIP
  • Multiop links: Read or write multiple links from (possibly) multiple objects in one request - WIP
  • Hyperchunking: use efficient chunk shape when linking to HDF5 files that have smaller chunks - implemented for 1D datasets, planned for multi-dimensional
  • h5copy/h5move - enable these hdf5 library style operations - not started, but have a design doc here: https://github.com/HDFGroup/hsds/blob/master/docs/design/async_tasks/async_tasks.md
  • use parquet for variable length chunk storage: will enable better performance for variable length datasets: not started

These changes are all backward compatible at the REST API level (meaning existing clients shouldn’t break), but utilizing new features will require some changes. To this end, I’d like to coordinate new releases for h5pyd and the REST-vol so that these features will be available to Python and C users.

And since this release does involve REST API changes, it would be a great time to update the api documentation, which is years out of date (though still useful): h5serv Developer Documentation — h5serv 0.1 documentation.

Anyway, this the plan! Reply here if you have questions about a particular feature, or if you have something else you’d like to see.

For features that are already implemented, you are free to try them out by building the HSDS image from the master branch. The test suite is fairly robust at this point, so the intent is that any feature should be working as designed. Will be happy to get any feedback on the pre-release code.

BTW, if you are curious why we are still not at a v1.0 of HSDS yet… When the project first got started, the idea was to have a v1.0 once we supported all the major features of the HDF5 library in HSDS. We are still not quite there yet though getting closer! After 0.9, the two missing items will be support for Opaque types and Region references.

3 Likes

As mentioned above, we are planning to have a new h5pyd release that takes advantage of the new HSDS features. Besides improved h5py compatibility a major theme will be improving performance through the use of “multi” operations. This enables python code to better take advantage of the parallelization features of HSDS. I suspect will see dramatic speed ups for certain uses cases.

Anyway, here’s a rather detailed list of planed updates. Is there something not on the list that you’d like to see? Let us know!

Groups

  • link cache: replace objdb cache with a link cache using the GET links with follow links param. Use Limit param with a fairly high value (100K links will be ~20MB).
  • use post rather than get (avoid unencodable name issues)
  • support list links by creation time
  • implement copy
  • implement move
  • multi-setlink - create multiple links in one request
  • multi-getlink - get multiple links in one request
  • visititems - fetch all links (if not already in cache) rather than making request per object
  • support object creation methods with creating groups on the fly. e.g. f.create_group(“/g1/g1.1/g1.1.1”) where g1 and g1.1 are created as side-effects

Attributes

  • attr_cache: replace objdb cache with an attribute cache using GET attrs with follow links. Initialize at first attribute access. Use the max_data_size param to only read attribute data for smaller attributes. Limit param to limit link total.
  • implement attrs.modify method (using PUT AttributeValue)
  • use post rather than get for fetching specific attributes (avoid unencodable name issues)
  • multi set method - set/create multiple attributes in one request
  • multi get method - get multiple attributes in one request
  • support list attrs by creation time

Types

  • array_type fixes
  • support fixed_utf8
  • support boolean type - just map to H5T_STD_U8?

Datasets

  • reflect h5py updates to ChunkIterator
  • verboseInfo - force serverside update if not available
  • resize - verify reduce extent works
  • implement readmulti - read multiple datasets
  • implement writemulti - write multiple datasets
  • pointselection - use binary rather than json for request
  • fieldSelection - utilize field selection param in HSDS
  • broadcasting - utilzie broadcast query param rather than doing on client side
  • read_direct - support broadcasting
  • write_direct - support broadcasting
  • implement SWMR refresh method - do a GET for updated dataset metadata
  • implement SWMR flush method - make flush request to HSDS
  • fix POST for point selection to not clear the request cache (see TBD httpconn.py, line 587)
  • update astype - I think this has changed in current h5py, need to verify

Other

  • Update setup to use toml
  • Create github actions for testing
  • documentation - readthedocs, much of the content can be borrowed from the h5py docs!
3 Likes