We have been doing some cross testing with other formats recently, and noticed that reading attributes was extremely slow. We’ve been using HDF5 as a file format for 15+ years (eman2.org) and most of our performance testing was done long ago. In a typical HDF file we put anywhere from 1000 - 100,000 data sets, and each data set will have 20-100 small attributes associated with it. Most of these are simple numbers, a few are strings. We use the HDF5 library directly in C, so have very precise control over what we are doing.
Anyway, in our test, we read in all of the attributes from all of the data sets in the file, but none of the actual data. I then ran this through valgrind to see where all the time was being spent. 98% of the time is within the HDF library, and 2/3 of this time is spent in H5A_open_by_idx(). Digging deeper, it appears that every call to this function is calling H5A_build_compact_table() and then releasing it, and this accounts for pretty much the entire problem.
My interpretation is that every time I open a single attribute it reads in and builds the entire index of attributes (including a qsort and other things), finds the single element I asked for, then releases the whole thing again. I looked for solutions in the library, but even if I used the H5A iteration routine, I would still have to call open on each attribute.
After reading in the forums here a bit, I found that others had had issues with attributes starting with 1.8, but that 1.8 also had an alternative storage scheme using B-trees, which should improve things, so I added:
H5Pset_libver_bounds(accprop, H5F_LIBVER_18, H5F_LIBVER_LATEST);
to my code. Alas, reading a file written with this set actually took 5X longer than reading without this option. I also tried updating the HDF library to 1.10 (we usually use the latest patch of 1.8), but that didn’t help either.
Is there anything I can do? My goal is to read in the entire set of attributes, so I don’t care if I open by name or by index, or any other method. I’m just trying to avoid the silly overhead of reading and building the table for every single attribute read.
Thanks in advance for any suggestions at all!