We have a use case where we want to extract many non-consecutive rows from a DataSet
, so we thought we would create a union selection in a DataSpace
to use in the subsequent DataSet::read()
call. However, when the rows were non-consecutive, the construction of the union was extremely slow - in fact, slower than the actual reading of the data from file! The C++ code below provides a simple standalone example:
#include "H5Cpp.h"
#include <vector>
#include <iostream>
#include <algorithm>
#include <ctime>
int main (int argc, const char** argv) {
if (argc!=4) {
std::cout << argv[0] << " [FILE] [DATASET] [CONSEC]" << std::endl;
return 1;
}
const char* fname=argv[1];
const char* dname=argv[2];
const bool consec=(argv[3][0]=='1');
H5::H5File hfile(fname, H5F_ACC_RDONLY);
H5::DataSet hdata=hfile.openDataSet(dname);
H5::DataSpace hspace=hdata.getSpace();
hsize_t dims_out[2];
hspace.getSimpleExtentDims(dims_out, NULL);
const size_t total_nrows=dims_out[0];
const size_t total_ncols=dims_out[1];
// Defining submatrices.
const size_t NR=10000, NC=100;
hsize_t h5_start[2], h5_count[2];
h5_start[0]=0;
h5_start[1]=0;
h5_count[0]=1;
h5_count[1]=NC;
{
hspace.selectNone();
clock_t start=std::clock();
size_t counter=0;
for (size_t i=0; i<NR; ++i) {
counter += (consec ? 1 : 2); // non-consecutive.
h5_start[0]=counter;
hspace.selectHyperslab(H5S_SELECT_OR, h5_count, h5_start);
}
clock_t end=std::clock();
std::cout << "Elapsed for union: " << double(end - start)/CLOCKS_PER_SEC << std::endl;
}
std::vector<double> storage(NR * NC);
h5_count[0]=NR;
H5::DataSpace outspace(2, h5_count);
outspace.selectAll();
{
clock_t start=std::clock();
hdata.read(storage.data(), H5::PredType::NATIVE_DOUBLE, outspace, hspace);
clock_t end=std::clock();
std::cout << "Elapsed for read: " << double(end - start)/CLOCKS_PER_SEC << std::endl;
}
double total = std::accumulate(storage.begin(), storage.end(), 0.0);
std::cout << "Total is: " << total << std::endl;
return 0;
}
I should point out that the above example takes every second row for the sake of simplicity; our actual use case does not have such a predictable structure, so we can’t use stride
in selectHyperslab
.
We ran this on a HDF5 file containing a gene expression matrix for about 30000 genes (columns) and 1 million cells (rows), which can be obtained from https://experimenthub.bioconductor.org/fetch/1040. Compiled against the HDF5 library (version 1.10.3), I get:
$ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 1 # consecutive
Elapsed for union: 0.002197
Elapsed for read: 0.023256
Total is: 198451
$ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 0 # non-consecutive
Elapsed for union: 4.7186
Elapsed for read: 0.227834
Total is: 190563
The longer read I can understand, as it involves twice as many chunks (though I am surprised it’s 10 times longer rather than just 2-fold); however, the time taken by the union is very unexpected. At this rate, it seems faster to just call DataSet::read()
multiple times instead of using a single call with a union… which seems to defeat the purpose of having this union capability in the first place.