Performance reading data with non-contiguous selection

I’m hitting a performance issue which seems very similar to this thread from 2012.

I can select 6000 entries along the first dimension, about 750 MB of data, and read it in about 300 ms. But if I try to select blocks of 3 entries at a stride of 100 - so 3% of the data volume - and read it to a contiguous array, it takes much longer, about 2.4 seconds. However, if I read into a matching selection in memory, it’s fast again (~8 ms). Using h5py, these options look like:

# Read all the data - fast
ds.read_direct(arr, source_sel=np.s_[:6000])

# Read a small subset of data with strides - slow
sel = h5py.MultiBlockSlice(start=0, stride=100, count=60, block=3)
ds.read_direct(arr, source_sel=sel, dest_sel=np.s_[:180])

# Read the small subset to a matching selection in memory - fast
ds.read_direct(arr, source_sel=sel, dest_sel=sel)

The h5py read_direct method is largely a wrapper around H5Dread, and from looking at the code plus what I can easily profile in Python, I’m pretty confident it’s the call to H5Dread that’s taking the time.

Background info: dataset shape is (76800, 1, 256, 256), data type is uint16, storage is chunked with chunk shape (32, 1, 256, 256). There should be no data type conversion involved (my array in memory has the same datatype as the dataset). I’m using the default file driver on Linux, with h5py 3.4.0 and HDF5 1.12.1.

In the 2012 thread, @epourmal said that this was a known issue, and it had been entered in the bug tracker. Does the same issue still affect the latest version of HDF5? Or has that been fixed and I’m seeing something similar but with a different root cause?

If necessary, I can probably come up with a workaround using the matching selections and then copying data together afterwards, similar to what Chris described in that thread. Even from Python, this could be significantly faster. But it’s fiddly, and I’d much rather discover that it’s not necessary.

I’ve had to make a workaround for this, doing something similar to what Chris found back in 2012. This is roughly what’s needed to do this in Python:

# Read part of a dataset into a matching array in memory
tmp = np.zeros(ds.shape, dtype=ds.dtype)
sel = h5py.MultiBlockSlice(start=0, stride=100, count=60, block=3)
ds.read_direct(tmp, source_sel=sel, dest_sel=sel)

# Translate the HDF5 selection to a 1D mask array
mask = np.zeros(len(tmp), dtype=np.bool_)
for i in range(0, 6000, 100):
    mask[i:i+3] = True

# Copy the data that was just read into a smaller array
res = np.compress(mask, tmp, axis=0)

numpy.zeros() uses calloc() to allocate memory for the temporary array. This allows the operating system to be smart and only actually use memory for the parts of the array you fill, so this can be less memory intensive than it appears. This blog post contains a nice explanation of this.