It seems that the dimensionality of the memory H5::DataSpace
has a major effect on the performance of the HDF5 API. If I wanted to read a rectangular A x B block of values from a two-dimensional H5::DataSet
, I could define the memory DataSpace
as either:
- two-dimensional, with dimensions A x B.
- one-dimensional, with dimensions 1 x AB.
As far as I can tell, these two choices yield identical results with respect to ordering of data values. However, use of the one-dimensional DataSpace
requires twice as much time compared to the two-dimensional approach. To demonstrate, I wrote a simple C++ program:
#include "H5Cpp.h"
#include <vector>
#include <iostream>
#include <algorithm>
int main (int argc, const char** argv) {
if (argc!=4) {
std::cout << argv[0] << " [FILE] [DATASET] [ONE_DIM]" << std::endl;
return 1;
}
const char* fname=argv[1];
const char* dname=argv[2];
const bool one_call=(argv[3][0]=='1');
H5::H5File hfile(fname, H5F_ACC_RDONLY);
H5::DataSet hdata=hfile.openDataSet(dname);
H5::DataSpace hspace=hdata.getSpace();
hsize_t dims_out[2];
hspace.getSimpleExtentDims(dims_out, NULL);
const size_t total_nrows=dims_out[0];
const size_t total_ncols=dims_out[1];
// Defining a submatrix of [2 x ncols] dimensions.
const size_t N=2;
hsize_t h5_start[2], h5_count[2];
h5_start[0]=0;
h5_start[1]=0;
h5_count[0]=N;
h5_count[1]=total_ncols;
// Defining the output DataSpace.
H5::DataSpace outspace;
if (one_call) {
std::cout << "Single dimension" << std::endl;
hsize_t output_dims=N*total_ncols;
outspace=H5::DataSpace(1, &output_dims);
} else {
std::cout << "Two dimensions" << std::endl;
hsize_t output_dims[2];
output_dims[0]=N;
output_dims[1]=total_ncols;
outspace=H5::DataSpace(2, output_dims);
}
outspace.selectAll();
// Looping across and extracting submatrices.
double total=0;
std::vector<double> storage(total_nrows*N);
for (size_t i=0; i<total_nrows; i+=N) {
h5_start[0]=i;
hspace.selectHyperslab(H5S_SELECT_SET, h5_count, h5_start);
hdata.read(storage.data(), H5::PredType::NATIVE_DOUBLE, outspace, hspace);
total += std::accumulate(storage.begin(), storage.begin() + total_ncols, 0.0); // summing the total of the first row in each block.
}
std::cout << total << std::endl;
return 0;
}
… and compiled it using HDF5 1.10.3. I then ran it on a HDF5 file with 10000 x 10000 double-precision values randomly sampled from a normal distribution, chunked by row (i.e., each row was its own chunk) and compressed using Zlib level 6. I got the same total
for either setting of ONE_DIM
, but the two-dimensional case was routinely faster by more than 2-fold.
I assume that this is driven by some behaviour of the HDF5 API when it encounters a discrepancy between the dimensionality of the file and memory DataSpace
s. In particular, I can imagine that the delay is due to some data rearrangement/copying that is triggered by this discrepancy and might be necessary in the general case, e.g., with strides. This leads to two thoughts:
- Assuming my diagnosis is correct, should the API be able to better detect situations where data reorganization is unnecessary and skip it for speed?
- Is there already (and if not, could we get) better documentation on when this reorganization is likely to occur, and how we can set up our parameters for optimal performance?
I’ll admit that trying to switch dimensionalities between DataSpace
s was a bit too cute on my part, though I didn’t think that it would lead to such a major performance degradation.