Using HDF5 for file backed Array


#1

Has anyone developed (and is willing to share) a C or C++ implementation of an “Array” class that uses HDF5 as a sort of memory mapped file? By this I mean I would like to create this “DataArray” class like the following:

// Describe the dimensions. This example would mimic a 3D regular grid of size 10 x 20 x 30
// with another 2D array of 60 x 80 at each grid point.
std::array<size_t, 3> tupleDimensions = {10, 20, 30};
std::array<size_t, 2> componentDimensions = {60, 80};

// Instantiate the DataArray class as an array of floats using the file
// /tmp/foo.hdf5" as the hdf5 file and “/data” as the internal hdf5 path to the actual data
DataArray data("/tmp/foo.hdf5", “/data”, tupleDimensions, componentDimensions);

// Now lets loop over the data
for(size_t z = 0; z < 10; z++)
for(size_t y = 0; y < 20; y++)
for(size_t x = 0; x < 30; x++
{
size_t index = // compute proper index to the tuple
for(size_t pixel = 0; pixel < 80 * 60; pixel++)
{
data[index + pixel] = 0;
}
}

Has anyone tried something like this before? We are currently writing a Zarr array abstraction layer but recent conversations seem to indicate that we should also be able to do with directly with HDF5 (since we use HDF5 as our native file format anyways).

Thanks
Mike Jackson


#2

I am working on std::container<T, ...> mapping for HDF5, while the specifics you are asking is currently not published the answer is maybe. The std::vector<T> and all supported LINALG/BLAS is zero copy for T element type with the exception of std::string which yields to variable length data.

Why specifically std::array? – this construct can be replaced with h5::read<T>(ds, target_memory_location, h5::count{...}); where target_memory is a typed memory location, managed by you.
Indeed my h5::write<..., std::array<T, ...>{}, ...) implementation does exactly this delegation with template meta-programming front end.

H5CPP slides, documentation, or reach out on this forum.
steve


#3

Steven, Thanks for the slides. H5CPP looks interesting. How does the generation work on MSVC based systems? We don’t typically spin up any kind of LLVM on those systems although Microsoft is now offering it as an optional install of Visual Studio.

I just used std::array<> to indicate some kind of “container” to describe the dimensions of the data, which could be construed as the chunking size also (maybe?). Really any STL container should work. Worst case is std::span around a pointer to an array.

Looking through the slides was a great introduction but a few questions I have:

Say I want a 200GB array, and I start stepping through that data, does H5CPP automatically figure out when to page in a chunk and page in the entire chunk at a time? We typically deal with know sizes of arrays at the outset of the algorithms although we have areas where we are extending datasets as the algorithm needs to store larger amounts of data.


#4

Hi @mike.jackson,

Are you looking for a way to process data in an out-of-core manner? We are currently developing a solution in HDFql which is able to achieve this functionality in a seamlessly fashion from the user point-of-view. Please let know if you would like to have additional details about it.


#5

Yes, we are developing an out-of-core feature for our open-source program (DREAM3D: http://www.github.com/bluequartzsoftware/dream3d).


Mike J.


#6

When you know the coordinates then h5::count{..} h5::offset{..} is the way to go; otherwise you have options:

  1. don’t do anything in this running thread, leave to the OS: embarrassingly parallel pattern has the lowest wasted cycles
  2. put green thread into sleep, pre-fetch on real thread, reschedule sleeping task when conditions met
  3. predictive controller: basically you are replacing HDF5, OS, … caching

As an alternative If you posted your pseudo code for paging, I take a closer look at it, and see whats the simplest way to get what you wanted. All in all: C++ gives so much freedom there is no one solution for evry case – however sometime 2011 I did implement a threaded pre-fetch mechanism for std::vector<T> with a custom std::iterator for the predecessor of H5CPP.

steve


#8

Hi @mike.jackson,

Not sure how familiar are you with HDFql but there are basically two ways to read/retrieve HDF5 data with this API:

  1. A user-defined memory (i.e. a variable which can be, e.g., a std::array). Example: SELECT FROM dset INTO MEMORY 0 (this reads dataset dset and populates a variable assigned to number 0).

  2. A cursor. Example: SELECT FROM dset INTO CURSOR (this reads dataset dset and populates HDFql cursor - users can then iterate over this cursor and retrieve data from it).

We are now extending #2 to help users processing huge HDF5 datasets that do not fit in main memory (RAM) more easily. Basically, it consists in extending traditional cursors (in HDFql) with sliding capabilities enabling out-of-core operations seamlessly. Canonically, to read a dataset and populate a sliding cursor with it looks as follows:

SELECT FROM dset INTO [SLIDING[(elements)]] CURSOR

In detail, parameter elements (if not defined, it defaults to 1) specifies how many elements (i.e. slice/subset) from dataset dset are to be read (HDFql uses an hyperslab under-the-hood to read this amount of elements without users knowing it) and populate the sliding cursor with it. When users try to retrieve data from the sliding cursor that falls outside the range it stores (i.e. before or after the first or last element stored in the sliding cursor, respectively), HDFql automatically: 1) discards the data currently stored in the sliding cursor, 2) reads a new slice/subset of data from dset (again thanks to an hyperslab), and 3) populates the sliding cursor with the new slice/subset of data.

As an example, let’s imagine a three dimensional (with size 1000x4096x4096) dataset of type double which is chunked (with size 5) in its first dimension. We want to print the number of elements (stored in the dataset) that are smaller than 20. Clearly, a standard machine can’t fit the entire dataset in its main memory (as the size is 125 GB). A (typical) solution to solve this is to loop over the first dimension (of the dataset) and, in each iteration, read a slice/subset (or chunk) of the data using an hyperslab. Using a traditional (i.e. non-sliding) cursor this looks like the following in C++:

char script[100];
int count;
int i;

count = 0;
for(i = 0; i < 1000; i += 5)
{
    sprintf(script, "SELECT FROM dset[%d:5:1:5] INTO CURSOR", i);   // retrieve a slice/subset thanks to an explicit hyperslab (with start=[0,5,10,...], stride=5, count=1 and block=5)
    HDFql::execute(script);
    while(HDFql::cursorNext() == HDFql::Success)
    {
        if (*HDFql::cursorGetDouble() < 20)
        {
            count++;
        }
    }
}
std::cout << "Count: " << count << std:endl;

This solution is already simple when compared with other APIs. However, when using an HDFql sliding cursor, the solution to solve this same example is even simpler (as the user doesn’t even need to specify an hyperslab at all):

int count;

HDFql::execute("SELECT FROM dset INTO SLIDING(5) CURSOR");

count = 0;
while(HDFql::cursorNext() == HDFql::Success)   // whenever cursor goes beyond last element, HDFql automatically retrieves a new slice/subset thanks to an implicit hyperslab (with start=[0,5,10,...], stride=5, count=1 and block=5)
{
    if (*HDFql::cursorGetDouble() < 20)
    {
        count++;
    }
}
std::cout << "Count: " << count << std::endl;

Hope the explanation/example illustrates well this new sliding cursor feature (which will be available in the next release of HDFql), and that it may eventually help you (and others) in your project!