Read/write specific coordinates in multi-dimensional dataset?

thomas1 · November 6, 2021, 8:06pm

Does anyone have a use case where it would be useful to read or write data at a list of individual coordinates in a multi-dimensional dataset?

h5py has the low-level machinery to support this, but currently the only documented way to use it is to select data with a boolean mask array. If your dataset is large, making a boolean array of the same shape in memory may be a problem. I made a PR some months ago to add a nicer API, but I don’t have a need for it myself, so I haven’t followed it up.

So if someone would like this, please speak up, and please comment on whether the high-level API I proposed for it would work for your use case. If we don’t hear from anyone, it will likely be closed in a few more months as not useful.

Other things that are already possible from h5py:

Reading & writing points in a 1D dataset - this works like NumPy ‘fancy indexing’: index a dataset with an array/list of numbers.
Reading & writing points along a line in a mutidimensional dataset - you can use ‘fancy indexing’ with one dimension, and regular indexing with the others.
Reading & writing arbitrary points one at a time - write a loop over your coordinates. Of course, this will be comparatively slow.

mullenkamp1 · January 29, 2024, 10:35pm

Hi Thomas,

I’m pretty sure I have the use case you’re referring to.

I have multiple files with multi-dimensional datasets that I want to combine/merge into a single file/dataset. These input files/datasets may not be nicely organised to be able to simply append more data onto another dataset when merging; i.e. file 1 may have data near the beginning and middle of the overall combined coordinates.
I had tried to use the existing h5py high-level API, but either it would throw an error because only one dimension can use a ndarray as input for slicing or it would be incredibly slow due to the sparseness of the read/write operations.

The most flexible and fast method I came up with was to create the new overall dataset with the combination of all dimensions with a specific chunking. Then I would iterate through each chunk and create a numpy array of exactly the shape of the chunk. For each chunk, I would read in the data at that chunk in all input files and build that chunk via the associated numpy array for that chunk. Once the chunk is totally filled, I save it to the final dataset via the h5py simple slicing.

It seems to work pretty well, but requires a lot of code and preprocessing of the input files/datasets to know exactly what parts of the datasets should go to what particular final chunk.

If you have any other suggestions, I’d love to hear it!

At least in my field (hydrology) with tons of sparse measurement data and numeric model output, this is a really fundamental use case. I don’t understand why I couldn’t find more example code for this. Back in the day, I used xarray…but all these kinds of operations are done in RAM…

Thanks!

jreadey · January 30, 2024, 12:18am

Hi Thomas, I expect it will be useful for someone. Unfortunately it’s not always the case that people who need it will be vocal.
HSDS supports point selection at the REST api level. Haven’t added put something for point selection into h5pyd yet, but if you implement the RFC, I can use the same signature for h5pyd.

jreadey · August 13, 2024, 5:04pm

I have multi-coordinate fancy selections working now in h5pyd. Here’s an example python notebook: h5pyd/examples/notebooks/fancy_selection.ipynb at master · HDFGroup/h5pyd · GitHub.

Hopefully we can get the feature in h5py as well.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Read/write specific coordinates in multi-dimensional dataset?