Quadratic runtime of selecting N hyperslabs

luc.grosheintz · July 29, 2024, 1:34pm

Summary

When creating a selection using H5Sselect_hyperslab from N slabs, we observe that H5Sselect_hyperslab has runtime that’s linear in the number of slabs already added. Hence the overall runtime is quadratic in the number of slabs.

This happens naturally in our I/O workloads. We commonly have indirect access, i.e. when we want to read the value of x we first look up the ranges of indices [index_range(k) for k in range(N)] and then load the indicated value from the dataset x, e.g. by using a hyperslab and OR’ing the index ranges into a single selection.

Reproducer

Here’s a simple reproducer:

// file: linear.cpp
#include <hdf5.h>
#include <chrono>
#include <iostream>

hid_t select_none(hsize_t * dims) {
  auto space_id = H5Screate_simple(2, dims, nullptr);
  H5Sselect_none(space_id);

  return space_id;
}


hid_t select(hsize_t * dims, size_t n_slabs) {
    hid_t space_id = select_none(dims);
    for(size_t i = 0; i < n_slabs; ++i) {
      hsize_t offset[2] = {i * 3, 0};
      hsize_t count[2] = {1, 4};

      H5Sselect_hyperslab(space_id, H5S_SELECT_OR, offset, nullptr, count, nullptr);
    }

    return space_id;
}

int main() {
  size_t n = 100000000, m = 20;
  hsize_t dims[] = {n, m};

  size_t nruns = 16;

  for(size_t k = 10; k < nruns; ++k) {
    size_t nsel = 1 << k;

    auto t1 = std::chrono::steady_clock::now();

    auto space_id = select(dims, nsel);

    auto t2 = std::chrono::steady_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() / 1e3;

    std::cout << nsel << ", " << elapsed << "\n";

    H5Sclose(space_id);
  }

  return 0;
}

which we compile and run as follows:

$ g++ -O2 -g linear.cpp -o linear -lhdf5 && ./linear
1024, 0.008
2048, 0.03
4096, 0.166
8192, 0.935
16384, 4.92
32768, 22.448

The first column contains the number of hyperslabs that are being OR’ed together and the second the runtime in seconds. Clearly, the runtime is quadratic, if not slightly worse.

Workaround

We noticed that H5Scombine_select is not quadratic in the number of selected slabs. Therefore, one can use the fact that the union of sets is associative and use H5Scombine_select to recursively combine the first and second half of the hyperslabs. Here’s the code to do so:

// file: divide-and-conquer.cpp
#include <hdf5.h>
#include <chrono>
#include <iostream>

hid_t select_none(hsize_t * dims) {
  auto space_id = H5Screate_simple(2, dims, nullptr);
  H5Sselect_none(space_id);

  return space_id;
}


hid_t select(hsize_t * dims, const std::pair<size_t, size_t>& range) {
    auto [i_min, i_max] = range;


    if(i_min == i_max) {
      return select_none(dims);
    }

    if(i_max - i_min == 1) {
      auto space_id = select_none(dims);

      hsize_t offset[2] = {i_min * 3, 0};
      hsize_t count[2] = {1, 4};

      H5Sselect_hyperslab(space_id, H5S_SELECT_OR, offset, nullptr, count, nullptr);

      return space_id;
    }

    size_t piv = i_min + (i_max - i_min) / 2;

    auto left_space_id = select(dims, {i_min, piv});
    auto right_space_id = select(dims, {piv, i_max});

    auto space_id = H5Scombine_select(left_space_id, H5S_SELECT_OR, right_space_id);

    H5Sclose(left_space_id);
    H5Sclose(right_space_id);

    return space_id;
}

int main() {
  size_t n = 100000000, m = 20;
  hsize_t dims[] = {n, m};

  size_t nruns = 16;

  for(size_t k = 10; k < nruns; ++k) {
    size_t nsel = 1 << k;

    auto t1 = std::chrono::steady_clock::now();

    auto space_id = select(dims, {0, nsel});

    auto t2 = std::chrono::steady_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() / 1e3;

    std::cout << nsel << ", " << elapsed << "\n";

    H5Sclose(space_id);
  }

  return 0;
}

We again compile and run as follows:

$ g++ -O2 -g divide-and-conquer.cpp -o divide-and-conquer -lhdf5 && ./divide-and-conquer
1024, 0.001
2048, 0.001
4096, 0.003
8192, 0.008
16384, 0.017
32768, 0.035

Related Topics:

There’s several (very old) related topics, but none of them seems to acknowledge the issue, or provide a workaround. For example:

dave.allured · July 29, 2024, 2:46pm

If high speed hyperslab access is your only constraint, then use contiguous storage rather than chunked. There is no lookup overhead in the library, for accessing hyperslabs in contiguous storage.

luc.grosheintz · July 29, 2024, 6:27pm

Thank you for your reply. Unfortunately, I don’t understand how it’s related to the issue at hand: contiguous vs. chunked is not part of the issue; and irrelevant to the performance observed. The issue does not happen during H5Dread. In fact there are no files, datasets or datatypes involved.

Let me clarify by creating a shorter pseudo-code reproducer:

auto t1 = start_timer();
auto space_id = space_id = H5Screate_simple(2, dims, nullptr);
for(size_t i = 0; i < n; ++i) {
      hsize_t offset[2] = {i * 3, 0};
      hsize_t count[2] = {1, 4};

      H5Sselect_hyperslab(space_id, H5S_SELECT_OR, offset, nullptr, count, nullptr);
}
auto t2 = stop_timer();

The time between t1 and t2 grows quadratically with n when it should be log-linear. Let me stress again: there is no datasets or actual reading involved it’s only related to building the dataspace. The performance bug in purely in H5Sselect_hyperslab.

If you look at the HDF5 code in question, e.g. H5S__hyper_merge_spans_helper I see the following pattern:

A linked list of spans (I suspect it’s sorted).
A while loop that checks if the new span is strictly before, overlapping (in several variations), strictly after the current slab. Performs the appropriate action and then advances to next span via ->next.

I can’t see any sign of a skip list or anything that could allow the code to perform the search in less than linear time. IIUC the problem is the equivalent of: using std::sort (a linear search) instead of std::lower_bound (bisection) to find an element in a sorted range.

dave.allured · July 29, 2024, 8:20pm

Luc, thank you for clarifying. I misunderstood your first example because I did not study it carefully. I think you are right, it should be possible to improve H5S__hyper_merge_spans_helper for adding new slabs to an already large slab tree, but at the expense of more complexity.

luc.grosheintz · July 30, 2024, 6:26am

What’s the HDFGroup’s proposed path forward?

I see two approaches:

Try to fix the performance bug in H5S__hyper_merge_spans_helper to ensure that it has the right algorithmic complexity when the trees are very unbalanced. I’d expect that by using bisection it should be possible to merge two selections of size N and M in time max(log(max(N, M)), min(N, M)). Meaning if one tree has size N and we insert a single slab, i.e. the case for (some) H5Sselect_hyperslab it runs in log-time, while merging two balanced trees runs in time proportional to the total number of elements.
Expect users to not use H5Sselect_hyperslab when merging hyperslabs in bulk and instead expect them to use H5Scombine_select, e.g. in a manner similar to the workaround sketched above. This could be achieved by documenting the complexity of H5Sselect_hyperslab and H5Scombine_select clearly and referring to H5Scombine_select from H5Sselect_hyperslab. If so it would be very helpful if H5Scombine_select could be polished to allow combining NONE and ALL selections (see separate topic).

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Quadratic runtime of selecting N hyperslabs

Summary

Reproducer

Workaround

Related Topics: