Hi,
I have a working solution for small amounts of data and am upscaling it now. I want to import 1 mil. images into HDF5 that will actually create 5 mil. chunked datasets and cca 10-20 mil. groups that index these in a tree-like structure.
It this point I am only allocating the datasets with these properties:
dataset_type = h5py.h5t.py_create(np.dtype('f4'))
dcpl = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
dcpl.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
space = h5py.h5s.create_simple(dataset_shape)
The ingestion runs smoothly at the beginning at this pace:
2021-11-12 15:54:03 INFO 100 images done in 2.5463s
2021-11-12 15:54:03 INFO Image cnt: 08700
2021-11-12 15:54:06 INFO 100 images done in 2.5123s
2021-11-12 15:54:06 INFO Image cnt: 08800
2021-11-12 15:54:09 INFO 100 images done in 3.0775s
2021-11-12 15:54:09 INFO Image cnt: 08900
2021-11-12 15:54:11 INFO 100 images done in 2.5066s
2021-11-12 15:54:11 INFO Image cnt: 09000
2021-11-12 15:54:15 INFO 100 images done in 3.5808s
2021-11-12 15:54:15 INFO Image cnt: 09100
2021-11-12 15:54:17 INFO 100 images done in 2.7173s
2021-11-12 15:54:17 INFO Image cnt: 09200
The BTrees start to split somewhere around 20,000 ingested images (100,000 datasets):
2021-11-12 16:00:02 INFO 100 images done in 2.7268s
2021-11-12 16:00:02 INFO Image cnt: 20400
2021-11-12 16:00:06 INFO 100 images done in 4.4550s
2021-11-12 16:00:06 INFO Image cnt: 20500
2021-11-12 16:00:17 INFO 100 images done in 10.2760s
2021-11-12 16:00:17 INFO Image cnt: 20600
2021-11-12 16:00:27 INFO 100 images done in 9.8562s
2021-11-12 16:00:27 INFO Image cnt: 20700
2021-11-12 16:00:36 INFO 100 images done in 8.9341s
2021-11-12 16:00:36 INFO Image cnt: 20800
2021-11-12 16:00:45 INFO 100 images done in 9.9003s
2021-11-12 16:00:45 INFO Image cnt: 20900
2021-11-12 16:00:54 INFO 100 images done in 8.9918s
2021-11-12 16:00:54 INFO Image cnt: 21000
2021-11-12 16:01:06 INFO 100 images done in 11.7186s
and at point of 500,000 I stopped it as it slows down to a crawl:
2021-11-19 12:53:27 INFO 100 images done in 438.6948s
2021-11-19 12:53:27 INFO Image cnt: 522600
2021-11-19 12:54:40 INFO 100 images done in 72.9178s
2021-11-19 12:54:40 INFO Image cnt: 522700
2021-11-19 12:54:45 INFO 100 images done in 5.5582s
2021-11-19 12:54:45 INFO Image cnt: 522800
2021-11-19 12:54:55 INFO 100 images done in 9.2149s
2021-11-19 12:54:55 INFO Image cnt: 522900
2021-11-19 12:55:02 INFO 100 images done in 7.3985s
2021-11-19 12:55:02 INFO Image cnt: 523000
2021-11-19 13:03:15 INFO 100 images done in 493.3294s
2021-11-19 13:03:15 INFO Image cnt: 523100
2021-11-19 13:14:05 INFO 100 images done in 649.1689s
2021-11-19 13:14:05 INFO Image cnt: 523200
2021-11-19 13:19:42 INFO 100 images done in 337.9557s
2021-11-19 13:19:42 INFO Image cnt: 523300
2021-11-19 13:24:54 INFO 100 images done in 311.4389s
2021-11-19 13:24:54 INFO Image cnt: 523400
2021-11-19 13:29:01 INFO 100 images done in 247.1409s
2021-11-19 13:29:01 INFO Image cnt: 523500
2021-11-19 13:29:08 INFO 100 images done in 7.3032s
2021-11-19 13:29:08 INFO Image cnt: 523600
My questions are:
- How can I debug which B-trees are actually the bottleneck? My suspects are either the group hierarchy or the chunk hierarchy in the file metadata. Can I use some internal HDF5 logging to debug this?
- How can I control parameters of these BTrees so I can tweak this? I know HDF5 is not primarily designed for this amount of datasets but having a few mil. entries in a BTree should not be a problem at all, if you correctly size it in the beginning. I expect the tradeoff to be overall slower access per one element in the Btree but a much fewer splits of the Btree happening.
Thank you very much. If you are interested into the reasons why I have designed my HDF5 like this, we have some publications already but I’d be happy to show you on a call.
Cheers,
Jiri