Limit on the number of datasets in one group?

I’m facing the issue which is well very well described in following post (not from me) from StackOverflow:

In this post, it has been found creation of datasets in one single group is becoming very slow (to the point it is not usable anymore) when reaching a limit around 2.8 million.

On my side, I’m facing similar issue (on windows, linux, hdf5 1.8 or 1.10 the same), but with a lower number of datasets in my group, circa 600,000. From my experiments, it turned out the difference comes from the length of datasets names: in my case, they are all named through string representations of unique ids (32 length), when the simple python code in StackOverflow post is naming datasets by integer rank converted into string, thus with small names. Thus it appears there is a limitation combining the number of datasets times the space used by their names above which trying to create additional datasets is almost freezing.

I guess it is not a good idea to start with to try to add that many datasets in one single group. And the long term fix in my code is probably to find one way or another of introducing a hierarchy of groups instead of having one single group only around. But is there some parameters available in HDF5 file configuration for pushing that limit higher? If these parameters exist, this could give me a very appreciated short term work-around.

Thanks in advance for your help,

Joël

PS: find below a small update of python script found in stack overflow post, with unique ids strings generated and used for naming datasets. With this variant, a “wall is hit” around 600,000 datasets in one single group.

import h5py
import time
import uuid

hdf5_file = h5py.File("C:\\TMP\\test.hdf5", "w")

barrier = 1
start = time.clock()
prev_i = None
for i in range(int(1e8)):
    dataset_name = uuid.uuid4().hex
    hdf5_file.create_dataset(dataset_name, [])
    td = time.clock() - start
    if td > barrier:
        if not prev_i:
            delta = i
        else:
            delta = i - prev_i
        prev_i = i
        print("Time {}: # dataset {} (delta: {})".format(int(td), i, delta))
        barrier = int(td) + 1

    if td > 600: # cancel after 600s
        break

Thanks for bringing this up because we have similar issues. It looks like having a lot of datasets is not what HDF is designed for. Yet, it is is some cases the only logical design.

In our case, we store example tensors (usually 4-D) for machine learning in HDF5 files. You can compare it to storing the pictures you want to do some pattern recognition on. It is natural to store each picture in a separate dataset. In that way, the file becomes a database of input tensors of which the machine learning algorithm can randomly select the examples.

The response: ‘why don’t you just add a dimension in the dataset’ may be a workaround but it kills the simple idea of having the ‘atoms’ of what you are working with in separately accessible data blocks. It would mean that you’d have to (arbitrarily) group many ‘pictures’ together just for performance reasons.

I think the direction of our solution has to be to spread the data over many files, because this ‘one logical dataset = one HDF dataset’ is just a very important concept. But a better solution would be if HDF comes up with a solution for this problem.

1 Like

Thank you Bert for your answer.

I made some small progress on my side, which I can share if it helps:

Following documentation of hdf5 group found at [https://support.hdfgroup.org/HDF5/doc/Advanced/MetadataCache/index.html],
I was able to push limit where performances are drastically dropping.

Basically, I called (in C/C++, don’t know how to access similar HDF5 function from python) H5Fset_mdc_config(), and changed max_size field in the config parameter, to 128 * 1024 * 124

Doing so, I was able to create 4 times more datasets.

Hope it helps.

If C++ is an option for you, and looking for performance, you may be interested in this project: H5CPP persistence for modern C++ with pythonic syntax. Here is the link to the ISC’19 presentation

best wishes: steven

1 Like

OK tx all, I am actually using C++. But I am just the ‘producer’ of such files, there are people working with Python who ‘consume’ the files, and they complain about poor performance when there are very many datasets in a file.

So, when setting the meta data cache parameters may be the way to go, I’d like to have a link for the Python guys. But I can’t exactly find anything useful on Google. Is there anyone following this thread who has some tips on setting the right cache parameters that will allow smooth performance when handling datafiles with potentially a few million datasets in them?