Performance guide? Writing real time, highly nested data

Hi all,

I’m using h5py to store data collected in real time from an in-the-loop VR setup. I’ve offset the main performance impact by moving the writer to its own thread, but I’m accumulating about 4MB/s in memory usage. Previously, we were preallocating numpy arrays and building datasets like so:

Category -> Data Type -> Frame #

Such that Frame IDs were duplicated across data types/categories. This had the advantage of allowing us to use hf_.create_dataset to prespecify the shape and allow preallocate numpy arrays of the appropriate dimension. We recently decided to restructure our data storage to be indexed by frame count:

Frame info # -> Category -> Data

To write an individual frame, I am writing directly into the new dataset as follows:

hf = hdf5.File("./log", "w")
hf["frame_number/category/data1"] = data1
hf["frame_number/category/data2"] = data2

This seems to have a negative performance impact (when and how preallocation is of benefit with hdf5, numpy, and python is unclear to me). I’m wondering if there are any recommendations as to how to speed this up, or furthermore if this might be a better application for a database (not sure what the relative benchmarks of sqlite/postgres are to writing directly to an hd5).

Thank you!