Dimensionality of the memory DataSpace when reading from a DataSet


#1

It seems that the dimensionality of the memory H5::DataSpace has a major effect on the performance of the HDF5 API. If I wanted to read a rectangular A x B block of values from a two-dimensional H5::DataSet, I could define the memory DataSpace as either:

  • two-dimensional, with dimensions A x B.
  • one-dimensional, with dimensions 1 x AB.

As far as I can tell, these two choices yield identical results with respect to ordering of data values. However, use of the one-dimensional DataSpace requires twice as much time compared to the two-dimensional approach. To demonstrate, I wrote a simple C++ program:

#include "H5Cpp.h"
#include <vector>
#include <iostream>
#include <algorithm>

int main (int argc, const char** argv) {
    if (argc!=4) {
        std::cout << argv[0] << " [FILE] [DATASET] [ONE_DIM]" << std::endl;
        return 1;
    }
    const char* fname=argv[1];
    const char* dname=argv[2];
    const bool one_call=(argv[3][0]=='1');

    H5::H5File hfile(fname, H5F_ACC_RDONLY);
    H5::DataSet hdata=hfile.openDataSet(dname);
    H5::DataSpace hspace=hdata.getSpace();

    hsize_t dims_out[2];
    hspace.getSimpleExtentDims(dims_out, NULL);
    const size_t total_nrows=dims_out[0];
    const size_t total_ncols=dims_out[1];

    // Defining a submatrix of [2 x ncols] dimensions.
    const size_t N=2;
    hsize_t h5_start[2], h5_count[2];
    h5_start[0]=0;
    h5_start[1]=0;
    h5_count[0]=N;
    h5_count[1]=total_ncols;

    // Defining the output DataSpace.
    H5::DataSpace outspace;
    if (one_call) { 
        std::cout << "Single dimension" << std::endl;
        hsize_t output_dims=N*total_ncols;
        outspace=H5::DataSpace(1, &output_dims);
    } else {
        std::cout << "Two dimensions" << std::endl;
        hsize_t output_dims[2];
        output_dims[0]=N;
        output_dims[1]=total_ncols;
        outspace=H5::DataSpace(2, output_dims);
    }
    outspace.selectAll();

    // Looping across and extracting submatrices.
    double total=0;
    std::vector<double> storage(total_nrows*N);
    for (size_t i=0; i<total_nrows; i+=N) {
        h5_start[0]=i;
        hspace.selectHyperslab(H5S_SELECT_SET, h5_count, h5_start);
        hdata.read(storage.data(), H5::PredType::NATIVE_DOUBLE, outspace, hspace);
        total += std::accumulate(storage.begin(), storage.begin() + total_ncols, 0.0); // summing the total of the first row in each block.
    }
        
    std::cout << total << std::endl;
    return 0;
}

… and compiled it using HDF5 1.10.3. I then ran it on a HDF5 file with 10000 x 10000 double-precision values randomly sampled from a normal distribution, chunked by row (i.e., each row was its own chunk) and compressed using Zlib level 6. I got the same total for either setting of ONE_DIM, but the two-dimensional case was routinely faster by more than 2-fold.

I assume that this is driven by some behaviour of the HDF5 API when it encounters a discrepancy between the dimensionality of the file and memory DataSpaces. In particular, I can imagine that the delay is due to some data rearrangement/copying that is triggered by this discrepancy and might be necessary in the general case, e.g., with strides. This leads to two thoughts:

  • Assuming my diagnosis is correct, should the API be able to better detect situations where data reorganization is unnecessary and skip it for speed?
  • Is there already (and if not, could we get) better documentation on when this reorganization is likely to occur, and how we can set up our parameters for optimal performance?

I’ll admit that trying to switch dimensionalities between DataSpaces was a bit too cute on my part, though I didn’t think that it would lead to such a major performance degradation.


#2

While this is an interesting result, I guess I find myself questioning the legitamacy of A x B dataspace being equiv. to 1 x AB. Isn’t that like trying to say that double Foo[A][B] is the same as double Bar[AB] (or did you mean double Bar[1][AB]?). Compilers wouldn’t treat these types as the same (though you can probably fudge that using aliasing), so why would be expect an I/O library to?


#3

That’s a fair point, and I’ll concede that the example above is somewhat contrived. However, there are cases where the distinction between the DataSpace dimensionalities is less obvious. The described behaviour was originally encountered when I was trying to extract one column at a time from a DataSet (double precision from a random normal, 100 x 100 chunks with 100 rows and 100,000 columns). It was natural to think of the read output as a column vector, so I naively used a 1D DataSpace in my initial code. With some luck, I eventually realized that I could achieve faster access by setting the memory DataSpace to a 2D submatrix corresponding to a single column. I would say that the latter is not the obvious choice.

To be clear, I don’t mind that the HDF5 library is doing different things internally for different memory DataSpace dimensionalities. It would be cool for the library to be smart enough to optimize away parameter differences that don’t have an effect, but some responsibility must lie with the user. To that end, it would be nice to have some more guidance - be it clearer documentation or a run-time slap on the wrist - regarding how to set up the memory DataSpaces for maximum performance.

P.S. I would have liked to set up the MWE to demonstrate the effect when extracting individual row vectors, but the timing differences doesn’t seem to occur when N is set to 1. I’m not sure what conditions are necessary to trigger this - chunking? column access? - but I hope that the current MWE is indicative enough. A (very downstream) discussion of the consequences of changing DataSpace dimensionality for column vector extraction is available at https://github.com/LTLA/flmam/issues/1.


#4

Yeah…I think this is all too often the case with the HDF5 API, especially for newer users. There needs to be a set of “recipes” and examples out there that use realistic cases and example data and demonstrate the advantages and disadvantages of common approaches as well as explain why some non-obvious approaches are necessary.


#5

Hello Aaron,

I tried reproducing this with a C program, and I was able to see that using a 1D memory space is slower. I created a bug report (HDFFV-10630) so that we investigate this issue further.

Thanks!
-Barbara