Hello,
I am reaching out to seek help on how to better utilize HDF5 for my job. I’m going to give as much upfront background as possible and am willing to come back with more information if asked.
Background:
We use hdf5 to record our simulation data, all data is stored as compound types and gets written out as a Nx1 array of these compound types. We try to maximize disk space by using maximum gzip compression. A single hdf we write can range from 25 MB to upwards of 2.5 GB. We write around 100 datasets at the minimum.
Some datasets can be upwards of 460 fields. Meaning, the compound type representing an entry in that dataset can have up to 460 “fields” that the equivalent c++ struct would have. All fields of a dataset are known apriori, but the amount of rows that dataset will have from the simulation is unknown (we resize the datasets often I believe).
(I’m more than happy to share more, or even give an example dump of one of our hdfs, I just don’t know what to share)
Problem:
Our biggest problem, in my opinion, and what I’m trying to optimize is reading the datasets back out. We read from the datasets many many times and when profiling code, 90% of our runtime is reading from disk (specifically in h5py). When reading we use the standard driver of the OS (Linux in this case), I tried toying around with Core Driver to read hdf into memory but it didn’t seem to provide any performance benefits.
For the datasets that contain dozens to 100s of fields, we typically care about a very small subset at most on any given read. We also read in all entries. This is the biggest problem, we want all rows but only a small subset of the fields.
So how can we optimize this sort of reading with hdf?
Remarks:
I’ve tried doing partially I/O of the compound type fields, but in some testing on our largest compound types (460 fields), the read times between reading all entries of 1 field and all entries of all fields, is 2 seconds at best. Which to me, makes partial field reading pointless.
Additionally, I’ve tried breaking the fields of these compound types into separate arrays. Meaning, in the case of a compound type with 460 fields. What I did was write 460 arrays, each in a dataset of that field name. This however killed our write times to the point that it would in no way be acceptable.