Performance improvements for many small datasets

kknielsen · September 26, 2022, 2:04pm

Hi.

We have an application that iterates over a set of sequential operations on a single machine.
In each of these steps it stores a large number of small meta data. This is put into different sub groups, and is stored as individual datasets. It also creates a single large image ~48mb.
But due all the http requests for all the small meta data, this part ends of being very time consuming. Even more than sending the large image.

This leads us to the following questions:
Is there any way of sending entire groups in a single request?
Or is the only way to achieve this to create a local file, and then uploading that with hsload, and then linking it into the original file?
Also is it possible to create multiple links in a single http request?

I hope this h5pyd example shows what we are experiencing.
Running with a local posix instance with 6 threads (runall.sh 6), gives me the following timings:
Mean timings datasets 1.738 sek, std: 0.368
Mean timings Im 0.819, std: 0.218

import h5pyd as h5py
import time
import uuid
import numpy as np

class ThingItem:
def init(self, name,age,version,data):
self.name = name
self.age = age
self.version = version
self.data = data

“”"
Storing approaches
“”"
def store(group, items):
for key, val in items.items():
if type(val) == dict:
g = group.require_group(key)
store(g, val)
if type(val) == ThingItem:
group.attrs[“name”] = val.name
group.attrs[“age”] = val.age
group.attrs[“version”] = val.version
else:
group.create_dataset(key, data=val)

totRunningTime = 0

“”"
Creating test data
“”"
child= {}
child[“name”] = “John”
child[“age”] = “32”
child[“adress”] = “some street”

itm = ThingItem(“Jens”, 42,1,child)

things = {}
things[“item1”] = 42
things[“item2”] = “string test”
things[“child1”] = itm
things[“child2”] = itm
things[“child3”] = itm
things[“child4”] = itm

“”"
Running the test
“”"
N = 100
timingsData = np.zeros(N)
timingsIm = np.zeros(N)
for i in range(N):
filename = str(uuid.uuid4()) + “test8.h5”
fqdn = “/home/test_user1/” + filename

print(f"itteration {i} file:"+fqdn)
start = time.time()
with h5py.File(fqdn, "a") as f:
    g = f.require_group("/test")
    store(g, things)
end = time.time()
timingsData[i] = end-start
print(f"Saving small datasets: {timingsData[i]} sek" )

im = np.random.randint(0,10,size=[6000,4000], dtype=np.int16)

things["im"] = im
start = time.time()
with h5py.File(fqdn, "a") as f:
    f["im"] = im

end = time.time()
timingsIm[i] = end-start
print(f"Saving image: {timingsIm[i]} sek" )
print("")

print(f"Mean timings datasets {np.mean(timingsData)}, std: {np.std(timingsData)} ")
print(f"Mean timings Im {np.mean(timingsIm)}, std: {np.std(timingsIm)} ")

jreadey · September 27, 2022, 5:22am

Hi,

I created a gists here that’s based on your code: https://gist.github.com/jreadey/f612d5b1b8379ae653f93b87a96ea038.

Running it I too saw it was spending much more time in creating the small datasets vs the large one. As you probably guessed, this is due to the many round trips needed to the server for each create request.

There’s not currently a way to create multiple groups or datasets in one request. GrpahQL support (see: https://github.com/HDFGroup/hsds/issues/128) would be good for this - though there’d remain the problem of adapting the h5py api to use it.

I suspect using asyncio would help a lot. With asyncio you could have many in-flight requests to HSDS which will minimize the time you spend waiting for responses. This would require using the REST API though - I can try coming up with an example if you are interested.

kknielsen · September 27, 2022, 9:16am

Hi Jreadey.

First of, thanks for looking into it.

Yes, your gist looks like the test I made, I don’t know what happened to it when I pasted it in here.
I have not worked with GraphQL, but yes, that looks exactly like what I was hoping to achieve. Couldn’t h5pyd just transfer dictionaries (and nested dictionaries) to this GraphQL format, a little like in my example?
In the fastest version I have of this, that is actually what I do.
I.e. I transfer all the our nested dict objects to a json string, and store it as a string in stead, that actually looks very similar to wikis example request:

But I guess that will break a little with concept of having the h5py and h5pyd the same…
In hdf5, we usually close the file between each object we storer, so a direct transformation (for this type of case) would be many times slower than the example code in this example. Simply due the the opening and closing of the file and the objects.

When I wrote this request I was actually thinking that hsload would do something like this GraphQL approach under the hood. But after examine it the traffic with fiddler, I can se that the number of request are almost the same as in my example. So I guess that is just unpacking the file on the client side, and sending each element (dataset and attribute) in the file as individual requests?

I think that an example like that would be really nice to have somewhere in the documentation.
But that approach would also mean that we will also need to implement the authentication part ourself. We where planning to use it as in the OpenID Connect.

FYI, on a small side note, I think the path given by _getKeycloakUrl https://github.com/HDFGroup/h5pyd/blob/master/h5pyd/_hl/openid.py
is no longer correct with the latest version of KeyCloak
https://keycloak.discourse.group/t/keycloak-x-realm-auth-urls-different/6715
But, as they suggest, I guess the it is posible to get back to that configuration by running KeyCloak with
kc.sh config -Dquarkus.http.root-path=/auth

jreadey · September 28, 2022, 6:59am

Yes GraphQL support will be a bigger task.
For now, I’m working on an async version of your code. I think that should have a nice performance improvement and won’t require any changes to HSDS.

Re KeyCloak: are you using KeyCloak yourself? If you could submit a PR with the fix, that would be much appreciated.

kknielsen · September 29, 2022, 1:50pm

Yes, I guess that would be a bigger task.
I have only shortly tested it locally, in the hope that we could use it as a placeholder for any OpenId providers later on.
And my local workaround, for this specific issue, was to just change that line.
But I did not get it to work completely. I just noticed this change in the Api while debugging it.
I will properly return to it in a near future.

But even If I do get it running this way, I guess it should rather be a config setting to chose between legacy url or the new one, right?
Or perhaps, even better, to just use the urls returned by:
openid_url: http://<server_dns>:<server_port>/auth/realms/<keycloak_realm>/.well-known/openid-configuration
Where the default is now also without the “/auth” part, so:
openid_url: http://<server_dns>:<server_port>/realms/<keycloak_realm>/.well-known/openid-configuration
But this is configurable in the config file.

jreadey · September 30, 2022, 8:02pm

Yes, it would be desirable to work with different versions of KeyCloak.
And not requiring an additional config would be good too.

Here’s your code example using async processing: https://gist.github.com/jreadey/8030e8a7fc3d0c9067c5a8d01d8507a8.

It’s a bit more verbose than the h5pyd version since it has to use the HDF REST API and aiohttp for async to work. (the python requests package doesn’t support async).

When I run this with HSDS on Docker I get:

$ python make_small_obj_test.py  /home/jreadey/small_obj.h5
N: 100
max_tcp_connections: 10
task_limit: 10
log_level: error
domain: /home/jreadey/small_obj.h5
Saving small datasets: 4.13s
Saving image: 0.39 s
group count: 500
dataset count: 200
attribute_count: 1200

There’s an option to set the number of parallel tasks. If I set this to 1 (simulating non-async) I get:

$ python make_small_obj_test.py --task-limit=1  /home/jreadey/small_obj.h5
    N: 100
    max_tcp_connections: 10
    task_limit: 1
    log_level: error
    domain: /home/jreadey/small_obj.h5
    Saving small datasets: 10.50s
    Saving image: 0.35 s
    group count: 500
    dataset count: 200
    attribute_count: 1200

Still the time required to create 1000’s of small objects is greater than creating the image dataset, but it’s more than 2x faster with the async version. Downside is that the code is more complicated. Let me know if you have any questions.

Eventually I hope to have a version of h5pyd that supports async (or maybe an entirely new package), that would make it a little easier to use.

kknielsen · October 6, 2022, 10:08am

Hi again.

Thank you for the example.
It is an interesting approach, and yes a bit more complicated.
If you only see a 2x speedup, I would assume that you must be hitting some of the same endpoints, on the server side.
But a good example to have in the repo

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Performance improvements for many small datasets