I’m facing the issue which is well very well described in following post (not from me) from StackOverflow:
In this post, it has been found creation of datasets in one single group is becoming very slow (to the point it is not usable anymore) when reaching a limit around 2.8 million.
On my side, I’m facing similar issue (on windows, linux, hdf5 1.8 or 1.10 the same), but with a lower number of datasets in my group, circa 600,000. From my experiments, it turned out the difference comes from the length of datasets names: in my case, they are all named through string representations of unique ids (32 length), when the simple python code in StackOverflow post is naming datasets by integer rank converted into string, thus with small names. Thus it appears there is a limitation combining the number of datasets times the space used by their names above which trying to create additional datasets is almost freezing.
I guess it is not a good idea to start with to try to add that many datasets in one single group. And the long term fix in my code is probably to find one way or another of introducing a hierarchy of groups instead of having one single group only around. But is there some parameters available in HDF5 file configuration for pushing that limit higher? If these parameters exist, this could give me a very appreciated short term work-around.
Thanks in advance for your help,
Joël
PS: find below a small update of python script found in stack overflow post, with unique ids strings generated and used for naming datasets. With this variant, a “wall is hit” around 600,000 datasets in one single group.
import h5py
import time
import uuid
hdf5_file = h5py.File("C:\\TMP\\test.hdf5", "w")
barrier = 1
start = time.clock()
prev_i = None
for i in range(int(1e8)):
dataset_name = uuid.uuid4().hex
hdf5_file.create_dataset(dataset_name, [])
td = time.clock() - start
if td > barrier:
if not prev_i:
delta = i
else:
delta = i - prev_i
prev_i = i
print("Time {}: # dataset {} (delta: {})".format(int(td), i, delta))
barrier = int(td) + 1
if td > 600: # cancel after 600s
break