Numpystyle fancy indexing of datasets

leo · July 23, 2024, 9:35am

Hi,
I have the case where I need to access may different scattered indices in a large dataset. I was hoping to use numpy style indexing to be able to do it in on request. There seems to be support for only one of those index lists in a request.
I saw there were already plans to do it for the h5py library but the cases where it would yield performance improvements on a local .h5 file seemed to be quite rare.
I wanted to ask if there are plans for it to still be implemented. I would be very interested in this feature for h5pyd as I think the performance increases will be quite significant in my usecase.

My workaround would be to download chunks on the client and then finalize the indexing there. Which would significantly increase traffic and code complexity. I appreciate other Ideas

I saw that the api itself supports multiindexing: Point selection
I assume writing custom code that generates requests is significant work.

Best regards
Leonard

mlarson · July 23, 2024, 5:01pm

There are no immediate plans to implement fancy indexing, but the recent MultiManager feature may be applicable for your use case. It is primarily intended for reading selections from multiple datasets, but can also be used to read distinct regions from the same dataset in parallel:

import h5pyd as h5py
import numpy as np

file = h5py.File('/home/test_user1/data.h5', 'w')

dset = file.create_dataset('dset', shape=(100, 100))
dset[:] = np.arange(10000).reshape(100, 100)

indices = [(np.s_[20:40], np.s_[20:40]), (np.s_[40:60], np.s_[40:60])]

mm = h5py.MultiManager([dset, dset])
data = mm[indices]

np.testing.assert_array_equal(data[0], dset[20:40, 20:40])
np.testing.assert_array_equal(data[1], dset[40:60, 40:60])

file.close()

Note that this feature does require building h5pyd from source, as it has not yet been published in a release version.

jreadey · July 23, 2024, 6:15pm

Hi Leo,
Can you provide an example of how you would use Numpy style indexing? Just want to make sure we are on the same page.

leo · July 24, 2024, 8:35am

Hey, thank you @mlarson for the suggestion. I am John already showed me the multimanager in another post. It came in quite handy for me so far. I would still prefer to not have to use it for this case, as it does not scale that well when sending a lot of requests.

@jreadey I wrote you an example of what I mean:

import h5py # as h5pyd
import numpy as np


# example with numpy, by passing two lists as arguments to index
arr = np.arange(2*3*4).reshape(2,3,4)
print(arr[:,[0,1],[1,2]])


file = h5py.File('example_data.h5', 'w')

dset = file.create_dataset('dset', shape=(100, 100))
dset[:] = np.arange(10000).reshape(100, 100)

# passig one index list is supported by h5py and h5pyd
print(dset[[0,1,2,3],0])
print(dset[[0,1,2,3],:3])

# passing two index lists is not supported
print(dset[[0,1,2],[0,1,2]])

file.close()

My problem would require the usecase with the 2 one dimensional index lists.

Another solution I came up with was to flatten my data in the two dimensions that I would require to access often with single indices.

jreadey · August 13, 2024, 4:41pm

Ok, I think I have this working now. See: h5pyd/examples/notebooks/fancy_selection.ipynb at master · HDFGroup/h5pyd · GitHub

I didn’t realize before that this feature would need some HSDS modifications. I’ve made those, so to use it, you’ll need to get the latest HSDS from master as well as build h5pyd from the master branch.

Please try it out and let us know how it works for you!

leo · August 21, 2024, 8:23am

Awesome!
I will try to set it up soon.
I already made an implementation where I would index one axis and download a whole range on a 2nd axis, which I would index locally with numpy. I’m interested to see how much performance I gain.

jreadey · August 22, 2024, 6:48am

Let us know, I’m curious too.

leo · September 5, 2024, 12:35pm

Is the current state of the master branch deployable as an AKS kubernetes cluster? When I build the nodes from source (and deploy from my ACR) they fail to start. I was only able to deploy from the hsds docker repository (default).

This is the errors I get when calling kubectl describe pods hsds-xxx

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  29s                default-scheduler  Successfully assigned default/hsds-c5d54db97-tm6tt to aks-nodepool1-64018474-vmss000000
  Normal   Pulled     15s (x3 over 29s)  kubelet            Container image "pcrhsdsregistry.azurecr.io/hsds:v1" already present on machine
  Normal   Created    15s (x3 over 29s)  kubelet            Created container sn
  Normal   Started    15s (x3 over 28s)  kubelet            Started container sn
  Normal   Pulled     15s (x3 over 28s)  kubelet            Container image "pcrhsdsregistry.azurecr.io/hsds:v1" already present on machine
  Normal   Created    15s (x3 over 28s)  kubelet            Created container dn
  Normal   Started    15s (x3 over 28s)  kubelet            Started container dn
  Warning  BackOff    1s (x3 over 27s)   kubelet            Back-off restarting failed container sn in pod hsds-c5d54db97-tm6tt_default(d9ddd0b7-0edc-4403-a3ce-66e08cdda13b)
  Warning  BackOff    1s (x3 over 27s)   kubelet            Back-off restarting failed container dn in pod hsds-c5d54db97-tm6tt_default(d9ddd0b7-0edc-4403-a3ce-66e08cdda13b)

Again good chance I made a mistake in the build or deploy process.

Also in the AKS install document the paths to some of the .yml and .sh files are lacking the directories they are in. I assume the files got moved at some point and the document didn’t get updated.

jreadey · September 5, 2024, 6:14pm

Can you take a look at the pod log files? If the problem is due to some misconfiguration, it’s usually clear from the log messages.

Thanks for pointing out the doc problem. I’ll do an update for this soon.

leo · September 9, 2024, 12:18pm

The logs lead me in the right direction. I am working on apple silicon and docker builds for the local machine by default. I needed to specify in the build.sh file to build for linux/amd64:
docker build --platform linux/amd64 -t hdfgroup/hsds

jreadey · September 9, 2024, 3:17pm

Ok, great. I guess you picked up this PR from last week: Set platform in docker-compose so it works on other platforms than amd64 by rho-novatron · Pull Request #389 · HDFGroup/hsds · GitHub?

Unless you are making custom code changes, Docker Hub has all the last images based on the master branch: https://hub.docker.com/repository/docker/hdfgroup/hsds/general

leo · September 10, 2024, 5:13pm

No I missed that PR, but good that it is setup to work by default!

leo · September 16, 2024, 10:48am

I did some testing with the fancy indexing. I ran into some issues when accessing random entries.

The test code from your script worked fine:

dset[:, [1,10,100],[10,100,100]]

But this access does not work for me:

dset[:, [1,10,100],[10,100,500]]

It takes a long time to finish and then I get a no data error.

Suprisingly the access works for me when the index is not too far from the diagonal:

dset[:, [1,10,100],[10,100,150]]

It also seems to work fine with larger requests, as long as they are diagonal:


indices = np.arange(10,800)
print(indices.shape)
print(dset.shape)

offset = 0
indices2 = np.arange(10+offset,800+offset)
dset[:, indices,indices2].shape

But setting offset = 1 will already cause it for me to fail.

There seems to be issues accessing non diagonal entries. It would be nice if you could verify it. I am running hsds on a minimal kubernetes cluster of one pod with one sn and one dn.

Edit: Added the logs to a failed request

REQ> GET: /datasets/d-0bf390da-4101df4d-fe38-e0b360-6cda57/value [pcrcontainer/fancy_select.h5]
INFO> getObjectJson d-0bf390da-4101df4d-fe38-e0b360-6cda57
INFO> validateAction(domain=pcrcontainer/fancy_select.h5, obj_id=d-0bf390da-4101df4d-fe38-e0b360-6cda57, username=admin, action=read)
INFO> getDomainJson(pcrcontainer/fancy_select.h5, reload=False)
INFO> aclCheck: read for user: admin
INFO> streaming response data for page: 1 of 1, selection: (slice(0, 5, 1), [1, 10, 100], [10, 100, 500])
INFO> doReadSelection - number of chunk_ids: 4
INFO> ChunkCrawler.__init__  4 chunks, action=read_chunk_hyperslab
INFO> ChunkCrawler - client_pool count: 6
INFO> ChunkCrawler max_tasks 4 = await queue.join - count: 4
INFO> ChunkCrawler - work method for task: cc_task_0
INFO> ChunkCrawler - client_name: cc_task_0.2
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> ChunkCrawler - work method for task: cc_task_1
INFO> ChunkCrawler - client_name: cc_task_1.3
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> ChunkCrawler - work method for task: cc_task_2
INFO> ChunkCrawler - client_name: cc_task_2.2
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> ChunkCrawler - work method for task: cc_task_3
INFO> ChunkCrawler - client_name: cc_task_3.2
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
REQ> GET: /info
 RSP> <200> (OK): /info
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_1): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (1,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_1_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
INFO> read_chunk_hyperslab, chunk_id: c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0, bucket: pcrcontainer
WARN> shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
ERROR> HTTPBadRequest for read_chunk_hyperslab(c-0bf390da-4101df4d-fe38-e0b360-6cda57_0_0_0): shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)

jreadey · September 18, 2024, 4:35pm

Yes, I can confirm that there’s a problem. I guess diagonal entries was a bad choice for a test!

I’ll look into it. Hopefully it will not be too hard to fix.

jreadey · September 20, 2024, 5:35pm

I have a fix checked into master HSDS now. No h5pyd changes needed. Please build HSDS from master and let me know how it goes.

leo · September 25, 2024, 1:19pm

I did some benchmarking on the data, the way it will be deployed in production. For each coorindate point I read from around 10 datasets with varying sizes. Sizes range from 720x360 to 102400x51200, with an additional dimension of 10-400 entries depending on the dataset.

The original code would index the first dimension and set a range for the 2nd dimension. The length of the range of the 2nd dimension would be determined by some chunking condition to not make the request too large. The 2nd dimension would then be indexed locally.

The new code would just send two arrays of indices for the 1st and 2nd dimension as a request.

Performance table for random reads:

num of points	old code [points/s]	new code [points/s]	batch size [points]
10	0.63	2.01	~10
1000	5.41	22.4	~70
10000	13.7	64.8	~240

So about a 4-5 performance increase for larger datasets, even without any further optimization, like hsds-chunk and request size optimization. I am sure there is quite a bit more performance to be squeezed out.

I am very happy with the results and appreciate you implemented the feature!

jreadey · September 25, 2024, 6:48pm

Awesome! Looks like you are getting better performance while the client code is simplified.

leo · November 19, 2024, 8:49am

Hi @jreadey , I wanted to ask when the next h5pyd release is planned, because I would prefer to install the package from pip instead of git in our pipeline.

jreadey · November 22, 2024, 1:13am

Sorry, it’s been too long since the last h5pyd release! I was hoping to have it out this week, but more likely it will be early next week.

jreadey · December 3, 2024, 7:43pm

H5pyd 0.19.0 is available on PyPi!
Let me know if you see any problems with it.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Numpystyle fancy indexing of datasets