HSDS 0.9 release target

jreadey · January 9, 2024, 2:03am

I’m hoping to have an official release of HSDS 0.9 out early this year, so it would be a good idea to take a stab at finalizing what features to include. There’s already a bunch of stuff that is in the master branch, that’s not in the current 0.8.5 release. On top of that there are some other things that would be nice to include but are still in the design phase (meaning that to include them will likely push out the release a bit).

Here’s a feature list going from those that are already implemented to those that are just an idea at this point:

Shape reduction: datasets can be reduced in size - implemented
Broadcasting: Numpy-style broadcasting of values over the entire selection - implemented
UTF8 fixed size strings: Enable UTF8 with a fixed number of bytes - implemented
Quickscan: Get domain wide usage (number of objects, storage used, etc.) on demand - implemented
n-bit and scale-offset filters: Enable these filters to be specified (though they don’t actually do anything!): implemented
bitshuffle filter: Support for bitshuffle (similar to byteshuffle but at bit level - see: GitHub - kiyo-masui/bitshuffle: Filter for improving compression of typed binary data.): implemented, but has an open issue
Update for array types: enhanced support for array types - implemented, but still some issues to resolve
Fieldops: read/write any subset of fields for a compound type dataset - implemented
Support for long attribute names/non-utf8 encodable attributes - implemented
Multiop attribute: Read or write multiple attribute from (possibly) multiple objects in one request - implemented
Support for long link names/non-utf8 encodable link names - WIP
Multiop links: Read or write multiple links from (possibly) multiple objects in one request - WIP
Hyperchunking: use efficient chunk shape when linking to HDF5 files that have smaller chunks - implemented for 1D datasets, planned for multi-dimensional
h5copy/h5move - enable these hdf5 library style operations - not started, but have a design doc here: https://github.com/HDFGroup/hsds/blob/master/docs/design/async_tasks/async_tasks.md
use parquet for variable length chunk storage: will enable better performance for variable length datasets: not started

These changes are all backward compatible at the REST API level (meaning existing clients shouldn’t break), but utilizing new features will require some changes. To this end, I’d like to coordinate new releases for h5pyd and the REST-vol so that these features will be available to Python and C users.

And since this release does involve REST API changes, it would be a great time to update the api documentation, which is years out of date (though still useful): h5serv Developer Documentation — h5serv 0.1 documentation.

Anyway, this the plan! Reply here if you have questions about a particular feature, or if you have something else you’d like to see.

For features that are already implemented, you are free to try them out by building the HSDS image from the master branch. The test suite is fairly robust at this point, so the intent is that any feature should be working as designed. Will be happy to get any feedback on the pre-release code.

BTW, if you are curious why we are still not at a v1.0 of HSDS yet… When the project first got started, the idea was to have a v1.0 once we supported all the major features of the HDF5 library in HSDS. We are still not quite there yet though getting closer! After 0.9, the two missing items will be support for Opaque types and Region references.

jreadey · January 25, 2024, 10:10am

As mentioned above, we are planning to have a new h5pyd release that takes advantage of the new HSDS features. Besides improved h5py compatibility a major theme will be improving performance through the use of “multi” operations. This enables python code to better take advantage of the parallelization features of HSDS. I suspect will see dramatic speed ups for certain uses cases.

Anyway, here’s a rather detailed list of planed updates. Is there something not on the list that you’d like to see? Let us know!

Groups

link cache: replace objdb cache with a link cache using the GET links with follow links param. Use Limit param with a fairly high value (100K links will be ~20MB).
use post rather than get (avoid unencodable name issues)
support list links by creation time
implement copy
implement move
multi-setlink - create multiple links in one request
multi-getlink - get multiple links in one request
visititems - fetch all links (if not already in cache) rather than making request per object
support object creation methods with creating groups on the fly. e.g. f.create_group(“/g1/g1.1/g1.1.1”) where g1 and g1.1 are created as side-effects

Attributes

attr_cache: replace objdb cache with an attribute cache using GET attrs with follow links. Initialize at first attribute access. Use the max_data_size param to only read attribute data for smaller attributes. Limit param to limit link total.
implement attrs.modify method (using PUT AttributeValue)
use post rather than get for fetching specific attributes (avoid unencodable name issues)
multi set method - set/create multiple attributes in one request
multi get method - get multiple attributes in one request
support list attrs by creation time

Types

array_type fixes
support fixed_utf8
support boolean type - just map to H5T_STD_U8?

Datasets

reflect h5py updates to ChunkIterator
verboseInfo - force serverside update if not available
resize - verify reduce extent works
implement readmulti - read multiple datasets
implement writemulti - write multiple datasets
pointselection - use binary rather than json for request
fieldSelection - utilize field selection param in HSDS
broadcasting - utilzie broadcast query param rather than doing on client side
read_direct - support broadcasting
write_direct - support broadcasting
implement SWMR refresh method - do a GET for updated dataset metadata
implement SWMR flush method - make flush request to HSDS
fix POST for point selection to not clear the request cache (see TBD httpconn.py, line 587)
update astype - I think this has changed in current h5py, need to verify

Other

Update setup to use toml
Create github actions for testing
documentation - readthedocs, much of the content can be borrowed from the h5py docs!