Logging contiguous data frames onto HDF5 datasets


#1

Hello,

In our application we currently log real-time data from a system that produces frames of data about 5kB in size. We are currently logging this data at 25Hz and would like to significantly increase the logging rate in the future. This data frame holds the values of many different variables (about 1200 variables) from our system, so we keep a mapping that describes the offset and data type of each variable within this frame.

In the logging application, which logs to an HDF5 file, each time we receive a frame we iterate over the mapping and individually log each data element into its own dataset in the HDF5 file. Effectively doing:

  1. For each data element:
    1.1 Get variable name, offset, and data type from the mapping
    1.2 Index into the data frame and copy the value
    1.3. Get a handle to the HDF5 dataset that has the same name and write value to it

This seems like it could be greatly improved by writing the whole data frame in one go each time a frame is received but I’m not sure this is a use case supported by HDF5, specially considering that we want to keep the hierarchical structure of the datasets in HDF5 file.

Is it possible to setup an HDF5 file in this way?


#2

Logging is a stream, opening and closing handles will do cost you a lot more than just keeping handles open.
Here is an example with h5cpp , here is the ISC’19 presentation slide were IO performance is compared. H5CPP packet table write: 290.57MB/sec POSIX IO(base line) 288.56MB/sec

|    experiment                               | time  | trans/sec | Mbyte/sec |
|:--------------------------------------------|------:|----------:|----------:|
|append:  1E6 x 64byte struct                 |  0.06 |   16.46E6 |   1053.87 |
|append: 10E6 x 64byte struct                 |  0.63 |   15.86E6 |   1015.49 |
|append: 50E6 x 64byte struct                 |  8.46 |    5.90E6 |    377.91 |
|append:100E6 x 64byte struct                 | 24.58 |    4.06E6 |    260.91 |
|write:  Matrix<float> [10e6 x  16] no-chunk  |  0.4  |    0.89E6 |   1597.74 |
|write:  Matrix<float> [10e6 x 100] no-chunk  |  7.1  |    1.40E6 |    563.36 |

The above chart was for high frequency trading:online quotes and trade packets, the sustained throughput is limited by the SSD and underlying file system performance.

best: steve


#3

Can you write all the variables to a single dataset, using a compound datatype?

Quincey

#4

I’m not sure how to read ‘about’? Is the frame size fixed? Is there an upper bound?
If so, to write the whole frame in one shot, you could use an opaque type.
hid_t H5Tcreate( H5T_class_t class, size_tsize ) with H5T_OPAQUE and your size.
A dataset of this type would hold one frame per element.

G.


#5

Hi Steve, we actually do not close/re-open the file handle on received each frame.The file handle remains open until the logging applications is terminated.

Hi Quincey, I wasn’t aware of the compound data type, but reading here it looks like it could work. Ideally the data is presented to the user as if it was one dataset per variable, would aliasing allow this?

Hi G., the frame is a fixed size.

I’m not familiar with opaque types and can’t seem to find a good source documenting its use. Would you have a link to share?


#6

Unfortunately, compound datatypes don’t present as different datasets and there’s no aliasing currently (although it is something we’ve talked about! :-). Per-field access is possible though, if that works for you.

Quincey

#7

https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5-examples/browse/1_10/C/H5T/h5ex_t_opaque.c

https://support.hdfgroup.org/ftp/HDF5/examples/python/hdf5examples-py/high_level/h5ex_t_opaque.py