Numpystyle fancy indexing of datasets

Hi,
I have the case where I need to access may different scattered indices in a large dataset. I was hoping to use numpy style indexing to be able to do it in on request. There seems to be support for only one of those index lists in a request.
I saw there were already plans to do it for the h5py library but the cases where it would yield performance improvements on a local .h5 file seemed to be quite rare.
I wanted to ask if there are plans for it to still be implemented. I would be very interested in this feature for h5pyd as I think the performance increases will be quite significant in my usecase.

My workaround would be to download chunks on the client and then finalize the indexing there. Which would significantly increase traffic and code complexity. I appreciate other Ideas :slight_smile:

I saw that the api itself supports multiindexing: Point selection
I assume writing custom code that generates requests is significant work.

Best regards
Leonard

There are no immediate plans to implement fancy indexing, but the recent MultiManager feature may be applicable for your use case. It is primarily intended for reading selections from multiple datasets, but can also be used to read distinct regions from the same dataset in parallel:

import h5pyd as h5py
import numpy as np

file = h5py.File('/home/test_user1/data.h5', 'w')

dset = file.create_dataset('dset', shape=(100, 100))
dset[:] = np.arange(10000).reshape(100, 100)

indices = [(np.s_[20:40], np.s_[20:40]), (np.s_[40:60], np.s_[40:60])]

mm = h5py.MultiManager([dset, dset])
data = mm[indices]

np.testing.assert_array_equal(data[0], dset[20:40, 20:40])
np.testing.assert_array_equal(data[1], dset[40:60, 40:60])

file.close()

Note that this feature does require building h5pyd from source, as it has not yet been published in a release version.

Hi Leo,
Can you provide an example of how you would use Numpy style indexing? Just want to make sure we are on the same page.

Hey, thank you @mlarson for the suggestion. I am John already showed me the multimanager in another post. It came in quite handy for me so far. I would still prefer to not have to use it for this case, as it does not scale that well when sending a lot of requests.

@jreadey I wrote you an example of what I mean:

import h5py # as h5pyd
import numpy as np


# example with numpy, by passing two lists as arguments to index
arr = np.arange(2*3*4).reshape(2,3,4)
print(arr[:,[0,1],[1,2]])


file = h5py.File('example_data.h5', 'w')

dset = file.create_dataset('dset', shape=(100, 100))
dset[:] = np.arange(10000).reshape(100, 100)

# passig one index list is supported by h5py and h5pyd
print(dset[[0,1,2,3],0])
print(dset[[0,1,2,3],:3])

# passing two index lists is not supported
print(dset[[0,1,2],[0,1,2]])

file.close()

My problem would require the usecase with the 2 one dimensional index lists.

Another solution I came up with was to flatten my data in the two dimensions that I would require to access often with single indices.

Ok, I think I have this working now. See: h5pyd/examples/notebooks/fancy_selection.ipynb at master · HDFGroup/h5pyd · GitHub

I didn’t realize before that this feature would need some HSDS modifications. I’ve made those, so to use it, you’ll need to get the latest HSDS from master as well as build h5pyd from the master branch.

Please try it out and let us know how it works for you!

Awesome!
I will try to set it up soon.
I already made an implementation where I would index one axis and download a whole range on a 2nd axis, which I would index locally with numpy. I’m interested to see how much performance I gain.

Let us know, I’m curious too.

Is the current state of the master branch deployable as an AKS kubernetes cluster? When I build the nodes from source (and deploy from my ACR) they fail to start. I was only able to deploy from the hsds docker repository (default).

This is the errors I get when calling kubectl describe pods hsds-xxx

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  29s                default-scheduler  Successfully assigned default/hsds-c5d54db97-tm6tt to aks-nodepool1-64018474-vmss000000
  Normal   Pulled     15s (x3 over 29s)  kubelet            Container image "pcrhsdsregistry.azurecr.io/hsds:v1" already present on machine
  Normal   Created    15s (x3 over 29s)  kubelet            Created container sn
  Normal   Started    15s (x3 over 28s)  kubelet            Started container sn
  Normal   Pulled     15s (x3 over 28s)  kubelet            Container image "pcrhsdsregistry.azurecr.io/hsds:v1" already present on machine
  Normal   Created    15s (x3 over 28s)  kubelet            Created container dn
  Normal   Started    15s (x3 over 28s)  kubelet            Started container dn
  Warning  BackOff    1s (x3 over 27s)   kubelet            Back-off restarting failed container sn in pod hsds-c5d54db97-tm6tt_default(d9ddd0b7-0edc-4403-a3ce-66e08cdda13b)
  Warning  BackOff    1s (x3 over 27s)   kubelet            Back-off restarting failed container dn in pod hsds-c5d54db97-tm6tt_default(d9ddd0b7-0edc-4403-a3ce-66e08cdda13b)

Again good chance I made a mistake in the build or deploy process.

Also in the AKS install document the paths to some of the .yml and .sh files are lacking the directories they are in. I assume the files got moved at some point and the document didn’t get updated. :slight_smile:

Can you take a look at the pod log files? If the problem is due to some misconfiguration, it’s usually clear from the log messages.

Thanks for pointing out the doc problem. I’ll do an update for this soon.