Union of non-consecutive hyperslabs is very slow

We have a use case where we want to extract many non-consecutive rows from a DataSet, so we thought we would create a union selection in a DataSpace to use in the subsequent DataSet::read() call. However, when the rows were non-consecutive, the construction of the union was extremely slow - in fact, slower than the actual reading of the data from file! The C++ code below provides a simple standalone example:

#include "H5Cpp.h"
#include <vector>
#include <iostream>
#include <algorithm>
#include <ctime>

int main (int argc, const char** argv) {
    if (argc!=4) {
        std::cout << argv[0] << " [FILE] [DATASET] [CONSEC]" << std::endl;
        return 1;
    }
    const char* fname=argv[1];
    const char* dname=argv[2];
    const bool consec=(argv[3][0]=='1');

    H5::H5File hfile(fname, H5F_ACC_RDONLY);
    H5::DataSet hdata=hfile.openDataSet(dname);
    H5::DataSpace hspace=hdata.getSpace();

    hsize_t dims_out[2];
    hspace.getSimpleExtentDims(dims_out, NULL);
    const size_t total_nrows=dims_out[0];
    const size_t total_ncols=dims_out[1];

    // Defining submatrices.
    const size_t NR=10000, NC=100;
    hsize_t h5_start[2], h5_count[2];
    h5_start[0]=0;
    h5_start[1]=0;
    h5_count[0]=1;
    h5_count[1]=NC;

    {
        hspace.selectNone();
        clock_t start=std::clock();
        size_t counter=0;
        for (size_t i=0; i<NR; ++i) {
            counter += (consec ? 1 : 2); // non-consecutive.
            h5_start[0]=counter;
            hspace.selectHyperslab(H5S_SELECT_OR, h5_count, h5_start);
        }
        clock_t end=std::clock();
        std::cout << "Elapsed for union: " << double(end - start)/CLOCKS_PER_SEC << std::endl;
    }
    
    std::vector<double> storage(NR * NC);
    h5_count[0]=NR;
    H5::DataSpace outspace(2, h5_count);
    outspace.selectAll();

    {
        clock_t start=std::clock();
        hdata.read(storage.data(), H5::PredType::NATIVE_DOUBLE, outspace, hspace);
        clock_t end=std::clock();
        std::cout << "Elapsed for read: " << double(end - start)/CLOCKS_PER_SEC << std::endl;
    }

    double total = std::accumulate(storage.begin(), storage.end(), 0.0); 
    std::cout << "Total is: " << total << std::endl;
    return 0;
}

I should point out that the above example takes every second row for the sake of simplicity; our actual use case does not have such a predictable structure, so we can’t use stride in selectHyperslab.

We ran this on a HDF5 file containing a gene expression matrix for about 30000 genes (columns) and 1 million cells (rows), which can be obtained from https://experimenthub.bioconductor.org/fetch/1040. Compiled against the HDF5 library (version 1.10.3), I get:

$ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 1 # consecutive
Elapsed for union: 0.002197
Elapsed for read: 0.023256
Total is: 198451
$ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 0 # non-consecutive
Elapsed for union: 4.7186
Elapsed for read: 0.227834
Total is: 190563

The longer read I can understand, as it involves twice as many chunks (though I am surprised it’s 10 times longer rather than just 2-fold); however, the time taken by the union is very unexpected. At this rate, it seems faster to just call DataSet::read() multiple times instead of using a single call with a union… which seems to defeat the purpose of having this union capability in the first place.

This is probably related to a problem I discussed several years ago with Quincey (and still exists).
Iteration with small hyperslabs through a DataSet is very slow. For example, iterating with vectors (thus an access for each vector) through a 3D DataSet is slow. Iterating with planes through a 3D dataset is much faster, while the same amount of data is read. In both cases the chunk cache was large enough to fit all chunks.

I suspect the chunk lookup is very slow and might touch the underlying B-tree for every access, thus every vector. But I’m not sure.

Cheers,

Ger

Ger and All,

We confirm that working with selections may be every slow. The issue has a very high priority on our to-do list.

Thank you!

Elena