Using h5netcdf and xarray to access data on HSDS

Using h5netcdf and xarray to access data on HSDS

John Readey will host Call the Doctor on Tuesday, September 30. He’ll talk about using h5netcdf and xarray to access data on HSDS. John is the principal architect of HSDS, so he’s great for any questions on using that service, but can also handle your general HDF5 questions.

To join, just jump on the zoom:
Launch Meeting - Zoom
September 30, 2025,12:20 p.m. central time US/Canada

Does this help with “GEDI datasets … distributed as complex HDF5 granules, which pose significant
challenges for efficient, large-scale data processing and analysis”
?

This is the notebook I’m planning to walk through today: h5netcdf and xarray sample using HSDS · GitHub

From my quick scan of the paper, it looks like GEDI is focused on the problem of dealing with large number of files (granules) that are connected in some spatial-temporal way. This is a common situation for e.g. NASA where a satellite may be generated one file per orbit, but the person doing the analysis may be more concerned with looking at data over 100’s or 1000’s of orbits. It’s a lot of data wrangling and code hacking to work through all of this by hand.

1 Like

How can HSDS help? If not, what’s missing? Virtualization?

1 Like

That’s an interesting topic. I’d ask the question: What could HDF5 do (in general) to help? Among the answers to that, there might be ones that are easier to implement in HSDS.

Example (and something that works now with HSDS) is that one of the motivations for keeping file sizes small is that smaller files are more convenient for downloading and are easier to manage on POSIX file systems. With HSDS there’s no need for downloading and arbitrary large domains can be stored in S3 (since an HSDS domain will be stored as many small objects). NREL has gone with this approach with domain sizes of up to 30 TB. This makes life easier for users as all the logically related data is right there in the same file (domain).

Here’s the recording from John’s session today:

The notebook he used in the session can be found online at https://gist.github.com/jreadey/15519d69b4905ac880a480a019cfcc9f