Using HDF5 for file backed Array

contact · August 3, 2022, 9:33pm

Not sure how familiar are you with HDFql but there are basically two ways to read/retrieve HDF5 data with this API:

A user-defined memory (i.e. a variable which can be, e.g., a std::array). Example: SELECT FROM dset INTO MEMORY 0 (this reads dataset dset and populates a variable assigned to number 0).
A cursor. Example: SELECT FROM dset INTO CURSOR (this reads dataset dset and populates HDFql cursor - users can then iterate over this cursor and retrieve data from it).

We are now extending #2 to help users processing huge HDF5 datasets that do not fit in main memory (RAM) more easily. Basically, it consists in extending traditional cursors (in HDFql) with sliding capabilities enabling out-of-core operations seamlessly. Canonically, to read a dataset and populate a sliding cursor with it looks as follows:

SELECT FROM dset INTO [SLIDING[(elements)]] CURSOR

In detail, parameter elements (if not defined, it defaults to 1) specifies how many elements (i.e. slice/subset) from dataset dset are to be read (HDFql uses an hyperslab under-the-hood to read this amount of elements without users knowing it) and populate the sliding cursor with it. When users try to retrieve data from the sliding cursor that falls outside the range it stores (i.e. before or after the first or last element stored in the sliding cursor, respectively), HDFql automatically: 1) discards the data currently stored in the sliding cursor, 2) reads a new slice/subset of data from dset (again thanks to an hyperslab), and 3) populates the sliding cursor with the new slice/subset of data.

As an example, let’s imagine a three dimensional (with size 1000x4096x4096) dataset of type double which is chunked (with size 5) in its first dimension. We want to print the number of elements (stored in the dataset) that are smaller than 20. Clearly, a standard machine can’t fit the entire dataset in its main memory (as the size is 125 GB). A (typical) solution to solve this is to loop over the first dimension (of the dataset) and, in each iteration, read a slice/subset (or chunk) of the data using an hyperslab. Using a traditional (i.e. non-sliding) cursor this looks like the following in C++:

char script[100];
int count;
int i;

count = 0;
for(i = 0; i < 1000; i += 5)
{
    sprintf(script, "SELECT FROM dset[%d:5:1:5] INTO CURSOR", i);   // retrieve a slice/subset thanks to an explicit hyperslab (with start=[0,5,10,...], stride=5, count=1 and block=5)
    HDFql::execute(script);
    while(HDFql::cursorNext() == HDFql::Success)
    {
        if (*HDFql::cursorGetDouble() < 20)
        {
            count++;
        }
    }
}
std::cout << "Count: " << count << std:endl;

This solution is already simple when compared with other APIs. However, when using an HDFql sliding cursor, the solution to solve this same example is even simpler (as the user doesn’t even need to specify an hyperslab at all):

int count;

HDFql::execute("SELECT FROM dset INTO SLIDING(5) CURSOR");

count = 0;
while(HDFql::cursorNext() == HDFql::Success)   // whenever cursor goes beyond last element, HDFql automatically retrieves a new slice/subset thanks to an implicit hyperslab (with start=[0,5,10,...], stride=5, count=1 and block=5)
{
    if (*HDFql::cursorGetDouble() < 20)
    {
        count++;
    }
}
std::cout << "Count: " << count << std::endl;

Hope the explanation/example illustrates well this new sliding cursor feature (which will be available in the next release of HDFql), and that it may eventually help you (and others) in your project!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Using HDF5 for file backed Array