Controlling BTree parameters for performance reasons

Hi,

I have a working solution for small amounts of data and am upscaling it now. I want to import 1 mil. images into HDF5 that will actually create 5 mil. chunked datasets and cca 10-20 mil. groups that index these in a tree-like structure.

It this point I am only allocating the datasets with these properties:

dataset_type = h5py.h5t.py_create(np.dtype('f4'))
dcpl = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
dcpl.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)
dcpl.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
space = h5py.h5s.create_simple(dataset_shape)

The ingestion runs smoothly at the beginning at this pace:
2021-11-12 15:54:03 INFO 100 images done in 2.5463s
2021-11-12 15:54:03 INFO Image cnt: 08700
2021-11-12 15:54:06 INFO 100 images done in 2.5123s
2021-11-12 15:54:06 INFO Image cnt: 08800
2021-11-12 15:54:09 INFO 100 images done in 3.0775s
2021-11-12 15:54:09 INFO Image cnt: 08900
2021-11-12 15:54:11 INFO 100 images done in 2.5066s
2021-11-12 15:54:11 INFO Image cnt: 09000
2021-11-12 15:54:15 INFO 100 images done in 3.5808s
2021-11-12 15:54:15 INFO Image cnt: 09100
2021-11-12 15:54:17 INFO 100 images done in 2.7173s
2021-11-12 15:54:17 INFO Image cnt: 09200

The BTrees start to split somewhere around 20,000 ingested images (100,000 datasets):
2021-11-12 16:00:02 INFO 100 images done in 2.7268s
2021-11-12 16:00:02 INFO Image cnt: 20400
2021-11-12 16:00:06 INFO 100 images done in 4.4550s
2021-11-12 16:00:06 INFO Image cnt: 20500
2021-11-12 16:00:17 INFO 100 images done in 10.2760s
2021-11-12 16:00:17 INFO Image cnt: 20600
2021-11-12 16:00:27 INFO 100 images done in 9.8562s
2021-11-12 16:00:27 INFO Image cnt: 20700
2021-11-12 16:00:36 INFO 100 images done in 8.9341s
2021-11-12 16:00:36 INFO Image cnt: 20800
2021-11-12 16:00:45 INFO 100 images done in 9.9003s
2021-11-12 16:00:45 INFO Image cnt: 20900
2021-11-12 16:00:54 INFO 100 images done in 8.9918s
2021-11-12 16:00:54 INFO Image cnt: 21000
2021-11-12 16:01:06 INFO 100 images done in 11.7186s

and at point of 500,000 I stopped it as it slows down to a crawl:
2021-11-19 12:53:27 INFO 100 images done in 438.6948s
2021-11-19 12:53:27 INFO Image cnt: 522600
2021-11-19 12:54:40 INFO 100 images done in 72.9178s
2021-11-19 12:54:40 INFO Image cnt: 522700
2021-11-19 12:54:45 INFO 100 images done in 5.5582s
2021-11-19 12:54:45 INFO Image cnt: 522800
2021-11-19 12:54:55 INFO 100 images done in 9.2149s
2021-11-19 12:54:55 INFO Image cnt: 522900
2021-11-19 12:55:02 INFO 100 images done in 7.3985s
2021-11-19 12:55:02 INFO Image cnt: 523000
2021-11-19 13:03:15 INFO 100 images done in 493.3294s
2021-11-19 13:03:15 INFO Image cnt: 523100
2021-11-19 13:14:05 INFO 100 images done in 649.1689s
2021-11-19 13:14:05 INFO Image cnt: 523200
2021-11-19 13:19:42 INFO 100 images done in 337.9557s
2021-11-19 13:19:42 INFO Image cnt: 523300
2021-11-19 13:24:54 INFO 100 images done in 311.4389s
2021-11-19 13:24:54 INFO Image cnt: 523400
2021-11-19 13:29:01 INFO 100 images done in 247.1409s
2021-11-19 13:29:01 INFO Image cnt: 523500
2021-11-19 13:29:08 INFO 100 images done in 7.3032s
2021-11-19 13:29:08 INFO Image cnt: 523600

My questions are:

  1. How can I debug which B-trees are actually the bottleneck? My suspects are either the group hierarchy or the chunk hierarchy in the file metadata. Can I use some internal HDF5 logging to debug this?
  2. How can I control parameters of these BTrees so I can tweak this? I know HDF5 is not primarily designed for this amount of datasets but having a few mil. entries in a BTree should not be a problem at all, if you correctly size it in the beginning. I expect the tradeoff to be overall slower access per one element in the Btree but a much fewer splits of the Btree happening.

Thank you very much. If you are interested into the reasons why I have designed my HDF5 like this, we have some publications already but I’d be happy to show you on a call.

Cheers,

Jiri

Jiri, have you tried

f = h5py.File('name.hdf5', libver=('v108', 'latest'))

How big are your datasets, files typically?

G.

The datasets are images (2000x1400px, 2 layers) in 5 different resolutions, each 4x less the size of the other, resulting in cca 40MB, 10MB, 2.5MB, 1.2 and 600KB. The chunk sizes I use are (128, 128, 2).

I’m actually using the latest version 1.12.1 - you mean I should use an older one for the libver parameter?

Thanks,

Jiri

The libver parameter controls the backward compatibility with improvements in the file format. Without that setting, you get maximum backward compatibility, but you might miss out on subsequent improvements (large numbers of links, attributes, newer B-trees, etc.). It’s worth checking if those improvements make any difference in your case. They may not.

Best, G.

It appears that you are treating each image as a separate dataset. What’s the rationale for that? Have you considered stacking images (of the same resolution) in 3D datasets, which would drastically reduce the number of objects. And your indexing might be simpler and faster as well, because your indices will be arrays, which you can load into and keep in memory.

G.

Well I tried it with the libver v108 and it behaves approximately the same, with the difference that from the beginning it takes cca 7 seconds instead of 3 to process 100 images and around 50000, the slowdown is again substantial:

INFO:rank[0]:100 images done in 9.7553s
INFO:rank[0]:Image cnt: 53000
INFO:rank[0]:100 images done in 20.7211s
INFO:rank[0]:Image cnt: 53100
INFO:rank[0]:100 images done in 30.6601s
INFO:rank[0]:Image cnt: 53200
INFO:rank[0]:100 images done in 7.7564s
INFO:rank[0]:Image cnt: 53300
INFO:rank[0]:100 images done in 6.8846s
INFO:rank[0]:Image cnt: 53400
INFO:rank[0]:100 images done in 11.5875s
INFO:rank[0]:Image cnt: 53500
INFO:rank[0]:100 images done in 41.9344s
INFO:rank[0]:Image cnt: 53600
INFO:rank[0]:100 images done in 10.4130s
INFO:rank[0]:Image cnt: 53700
INFO:rank[0]:100 images done in 7.1252s
INFO:rank[0]:Image cnt: 53800
INFO:rank[0]:100 images done in 10.4750s
INFO:rank[0]:Image cnt: 53900
INFO:rank[0]:100 images done in 38.1471s
INFO:rank[0]:Image cnt: 54000

There are many reasons for that which I would be to explain in a detail in a call.

Mainly it is because the sparsity of coordinates - if I would store all images in one dataset, I also need to store coordinates of every pixel next to it, effectively doubling the size of the data. Right now, I’m just storing the projection metadata as attributes of the dataset, similarly to FITS headers. But as mentioned, there are other reasons as well…