HSDS in production use

harri.hytonen · August 3, 2020, 5:55pm

Hi,

I’m working in a company that is using NetCDF4 and HDF5 files for exporting datasets from our LIDAR and RADAR type of devices and data processing systems. I’m currently evaluating if HSDS can be used to provide a new and smarter API for datasets compared to files. Based on HSDS documentation and a brief hands-on evaluation, HSDS looks very interesting. I have visioned perhaps 3 different of use cases for us:

Devices-to-cloud: A device buffers measurement data locally and creates slices of datasets. Slices are then send to a data processing system which is running in the cloud and provides a HSDS REST API for the device.
In-cloud: The cloud based data processing system contains micro services that each read slices of a dataset, processes data and then writes slices back by using HSDS REST API.
Cloud-to-application: A visualization application in browser or backend uses HSDS REST API to read slices based on various criteria e.g. bounding boxes, time window, customer id etc.

Q1: I want opinions if these super-high-level use cases are practical with HSDS?
Q2: An another concern is the current maturity level of HSDS. The latest version is 0.6-beta and maybe it is too early to use it for production level systems/APIs that our customers are paying for, or how do you feel?
Q3: Is there a HSDS release roadmap that would give a vision what can be expected to happen in HSDS development in the coming years?

BR,
Harri

jreadey · August 4, 2020, 8:01pm

Hi Harri,

Thanks for looking into HSDS! Your use cases sound quite interesting.

In response to your questions:

Q1: For the device-to-cloud scenario, you’ll need someway to make sure that the devices don’t overwrite each other’s data by writing to the same slice (e.g. you could partition the dataset in some fashion and have each device only write to its own partition). For 1-d datasets, there’s a h5pyd feature that lets the client extend the dataset and write to the extended region atomically.

For in-cloud, yes that’s very practical. I’d recommend using Kubernetes to manage the micro services. You can have HSDS run in the same Kubernetes cluster, then all traffic will be pod to pod. You’ll want to scale the number of HSDS pods based on the workload.

For cloud-to-application: Yes. Once nice thing is that you likely won’t need a server backend for the web application. The app can just be static html that loads data dynamically from HSDS. This blog: https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ talks about a web app that NREL created.

Q2: Granted HSDS is a relative newbie compared with the HDF5 library. Still we have had customers using HSDS in a production capacity for a number of years with good success. The HDF Group offers support contracts if you would like to have us consult on the design of your application and be available to resolve any issues that come up.

Q3: Sorry, we don’t have a release roadmap yet. In general much of the work we do is guided by customers who support us to develop specific features, so it’s hard to make predictions for what will be coming years from now.

Currently we are wrapping up the 0.6 release which has a bunch of new features: Azure support, AWS Lambda, RBAC. See https://github.com/HDFGroup/hsds/issues/47 for the complete list.

Anyone is welcome to send in request for items they would like to see in 0.7. One item I’d like to see is better support for C/C++ clients through enhancements in the rest-vol (https://github.com/HDFGroup/vol-rest) and filling in the gaps for HDF5 features that are not available in HSDS yet (e.g. Opaque data types).