Simple Question: Selecting just a 2D slice from 3D data


#1

I’m just beginning to program with HDF5 with the C libraries. I have some experience using h5py.

I have a 3D data set representing x,y, and t (time). I just want to select an x,y slice from a given time. This seems to be a job for the hyperslab select functions, but in the examples it says:

The memory dataspace passed to the read or write call must contain the same number of elements as the file dataspace.

So this means if I have a 100x100x500 dataset in a file, I have to make a 100x100x500 array just to read one time slice? This is not practical for my larger datasets. I want to have a 2D 100x100 array that I can store a slice from a given time.

Am I misunderstanding how to use hyperslabs? What is the best way to get a time slice from a dataset?


#2

Hi @andrew.h.gibbons,

Yes, it is possible to select just a 2D slice from 3D data. The way to do this, like you wrote, is through the usage of hyperslabs. One thing to keep in mind is that, when using hyperslabs, the dataset should be chunked with appropriate dimensions (i.e. according to the intrinsic logic of your use-case) so that the HDF5 library may read (or write) the dataset partially; otherwise, if the dataset is contiguous, the HDF5 library will read the entire dataset and only then slice the data according to your hyperslab.

Using HDFql in C your use-case could be solved as follows (as an example):

#include "HDFql.h"

int main(int argc, char *argv[])
{
    // declare variable 'data' to hold slice (i.e. chunk) of data
    int data[100][100];

    // create HDF5 file 'example.h5' and use (i.e. open) it
    hdfql_execute("create and use file example.h5");

    // create chunked (size 100x100x1) HDF5 dataset 'my_dataset' of type int (size 100x100x500)
    hdfql_execute("create chunked(100, 100, 1) dataset my_dataset as int(100, 100, 500)");

    // fill dataset 'my_dataset' with values

    // register variable 'data' for subsequent usage (by HDFql)
    hdfql_variable_register(&data);

    // read slice (i.e. chunk) #0 from dataset 'my_dataset' using an hyperslab (:::, :::, 0:::1)
    hdfql_execute("select from my_dataset(:::, :::, 0:::1) into memory 0");

    // process slice (i.e. chunk) #0 stored in variable 'data'

    // read slice (i.e. chunk) #50 from dataset 'my_dataset' using an hyperslab (:::, :::, 50:::1)
    hdfql_execute("select from my_dataset(:::, :::, 50:::1) into memory 0");
   
    // process slice (i.e. chunk) #50 stored in variable 'data'

    return 0;
}

#3

Thank you, I was looking at the chunking concept and was unsure whether I should use it. I will try out your example.

Thanks again


#4

The simplest case when the data is a few hundred megabyte – slurping it up in a single IO operation will not have impact on the application performance; and you have no intention to extend the data set. Adding complexity will slow you down implementing it (perusing docs, etc…) with little to no benefit:

100x100x500 @double is 5x10^6 x sizeof(double) which is 40MB on a 5 yo laptop this should take about 1ms on a newer one with nvme I ballpark it to 0.2ms.

Alternatively you are dealing with large datasets. There are two sub-cases here:

  1. there is a deadline, soft or hard
  2. total utilisation of IO bandwidth

Either way you need to use chunks, and for optimal performance have to factor in where the data is located. For Ethernet or local hardrive 1MB chunk size seems to be good option; whereas infiniband like fabrics may prefer smaller chunks.

When using chunks you are manipulating the data layout, the elements within a chunk are adjacent/contiguous which makes it cheap to access as a unit. The typical example is a massive in-memory matrix scanned by rows, then by columns – it turns out this makes a difference. (actually on CUDA and similar architecture this makes a big difference and is called coalescing
In your case you have to be sure you pack as much related data in a single chunk as possible – and the chunk size is roughly 512KB - 1MB. – To lower IO latency you can go down to the kernel page size, which may be architecture dependent: 4kB - 64kB, on my linux it is 4kB:

cd /proc/1
sudo grep -i pagesize smaps

So for I opened up a few considerations, and didn’t giv emuch thought to syntax, which should be similar to python h5py. Luckily few years ago @gheber convinced me of the advantage of pythonic syntax, so the following C++ example gets as close to it as possible:

#include <armadillo> // I am using linear algabra library here, as it has matrix
#include <h5cpp/all>
int main(){
   arma::mat M(100,100); // http://arma.sourceforge.net/docs.html

   h5::fd_t fd = h5::create("some_file_name.h5",H5F_ACC_TRUNC);
  // this is an extendable data set, with initial size, compressed with gzip,
  // and set to fill value `-1.0`
   h5::ds_t ds = h5::create<double>(fd,"/path/to/dataset"
     ,h5::current_dims{1,100,100}
     ,h5::max_dims{H5S_UNLIMITED,100,100}
     ,h5::chunk{1,100,100} | h5::fill_value<double>{-1.0} | h5::gzip{9});

   h5::write( ds, M, h5::offset{0,0,0});
} // RAII will close all resources

The above example is from H5CPP examples: linear algebra, and in real applications the fill value is rarely set. Also for single shot write there is no reason to create a data set upfront, much cleaner this way:

#include <armadillo> // I am using linear algabra library here, as it has matrix
#include <h5cpp/all>
int main(){
   arma::mat M(100,100); // http://arma.sourceforge.net/docs.html

   h5::fd_t fd = h5::create("some_file_name.h5",H5F_ACC_TRUNC);
  // this is identical with the previous example
   h5::ds_t ds = h5::write(fd,"/path/to/dataset", M,
     h5::current_dims{1,100,100}, h5::offset{0,0,0},
     h5::chunk{1,100,100} ,h5::max_dims{H5S_UNLIMITED,100,100});
   // NOT identical, but practical for single shot, non-extendable datasets
   h5::write(fd,"/some/other dataset", M);
} // RAII will close all resources

Idiomatically shows single shot create + write operation for two distinct cases, but in some cases you may need to record data from a stream: sensor networks, stock market, camera, set of microphones, …

#include <armadillo> // I am using linear algabra library here, as it has matrix
#include <h5cpp/all>
int main(){
   size_t nframes=1'000'000;
   arma::mat M(100,100); // http://arma.sourceforge.net/docs.html
   
   h5::fd_t fd = h5::create("some_file_name.h5",H5F_ACC_TRUNC);
   h5::pt_t pt = h5::create<double>(fd, "stream of matrices",
      h5::max_dims{H5S_UNLIMITED,nrows,ncols}, h5::chunk{1,100,100} );
   // actual code, you may insert arbitrary number of frames: nrows x ncols
   for( int i = 0; i < nframes; i++)
      h5::append(pt, M);
} // RAII will close all resources

With the exception of packet table h5:pt_t all resources/descriptors are binary compatible with the HDF5 CAPI, in other words you can pass appropriate hid_t to H5CPP IO operators, or versa, H5CPP descriptors to CAPI function calls.

H5CPP is an MIT licensed project with LLVM based optional compiler assisted introspection, . You find the documentation here and download it from my github page.

best wishes: steven


#5

Thanks for the response. I need some time to go through your reply in detail, but I will try to handle all the data at once without complicated splitting or processing first. I think the data sets at the moment can fit in a single IO operation