Hello,
I wanted to ask if there is a a functionality in h5pyd to retrieve data from multiple datasets/ groups in one call. Context:
I am asking for performance reasons, especially for reading fairly small subsets. My data forces me to put it in different datasets but I usually need to retrieve data from all datasets.
So far I haven’t found anything in the documentation except an old design document:
Has this been implemented?
Also I have tried to aceess different domains asynchronosly with the python asyncio librarly, which gave me the same runtime as accessing it sequentially. Maybe I made a mistake or does HSDS not support parallel requests from one client?
Yes, if you need to retrieve many small datasets, it can be a bit slow since the latencies between each request to HSDS add up.
When you experimented with asyncio, where you using the aiohttp package? Unless your http routines specifically support await, the calls are likely to be made sequentially anyway.
Another practical issue with asyncio is that unless your application is already designed with async processing in mind, it’s hard to bolt on some async functions later on.
Anyway, in order to provide a more practical way for Python users to benefit from parallel processing, we recently added a h5pyd feature to help with this use case: MultiManager. The MultiManager enables applications to read or write multiple selections from multiple datasets in one call. Internally, it uses Python threading to send one http request per selection in parallel.
The code is not yet in an official h5pyd release, but you can get it with a: $ pip install git+https://github.com/hdfgroup/h5pyd. Take a look at some of test code from:
and it should be fairly clear how it works. If anything is unclear, please let us know.
Once you’ve tried out MultiManager, I’d be curious to hear what kind of performance benefit you see. In our testing speedup varies quite a bit depending on the number of selections used, the size of the selections, and a host of other factors. Hopefully your application will get a good speedup!
Hi jreadey,
I am happy this feature has already been implemented in h5pyd. With the help of the code I was able to write a small benchmark script which compares accessing the data with the multimanager and sequentially. I tested it on effecively a netcdf file with one 5 dimensional variable and the 5 corresponding coordinate axis. So 6 variables in total with one being a lot larger than the other ones. I benchmarked by retrieving one random entry from each of the datasets in order to avoid caching effects.
Time for sequential access: ~400ms
Time with Multimanager: ~100ms
I got a factor of 4 of performance improvement compared to a theoretical limit of 6.
I would assume a large dataset creates a bit of search overhead. So retrieving a value from it takes longer than from the smaller ones. This would mean that accessing equal sized datasets would result better scaling of the MultiManager.
I wrote some generic benchmark code, feel free to use it:
def generate_range(ds_shape: tuple):
# generate a tuple of random indices for one dataset
indices = []
for axis_length in ds_shape:
index = random.randint(0, axis_length - 1)
indices.append(index)
return tuple(indices)
def generate_index_query(h5file):
# generate a list of index tuples
query = []
for ds in h5file.values():
ds_shape = ds.shape
indices = generate_range(ds_shape)
query.append(indices)
return query
def benchmark_multimanager(h5file, num=10):
"""
Benchmark retrieving one random entry from every dataset in an h5file
using the MultiManager.
"""
ds_names = list(h5file.keys())
datsets = [h5file[name] for name in ds_names]
mn = h5pyd.MultiManager(datsets)
# prepare queries to exclude from runtime
queries = []
for i in range(num):
query = generate_index_query(h5file)
queries.append(query)
# accessing the data
t0 = time()
for query in queries:
results = mn[query]
runtime = time() - t0
print(f"Mean runtime multimanager: {runtime/num} ")
# 100ms for case with 6 datasets
def benchmark_sequential_ds(h5file, num=10):
"""
Benchmark retrieving one random entry from every dataset in
an h5file by sequentially looping through the datasets
"""
# prepare queries to exclude this code from runtime
index_lists = []
for i in range(num):
index_list = []
for ds in h5file.values():
indices = generate_range(ds.shape)
index_list.append(indices)
index_lists.append(index_list)
# accessing the data
t0 = time()
for index_list in index_lists:
for indices, ds in zip(index_list, h5file.values()):
result = ds[indices]
runtime = time() - t0
print(f" Mean runtime sequentially: {runtime/num} ")
# ~ 400ms for case with 6 datasests
Will the Multimanager be added to the next release?
You tested with a local file right?
If so, I am quite impressed that there was so much to gain even locally. For remote access I would assume that it scales even better.