HDF5 Table Performance

richard.haney · September 16, 2021, 2:51pm

Good morning,

I am new to using HDF5 but am so far pretty impressed. I have a couple of questions though.

Currently I am doing some performance studies on appending compound datasets to a single table. I am iterating through a loop appending some number of records/rows to the table to measure performance. I am using the H5BTable API append records calls - are H5BTable appends good performance-wise or should a lower level HDF5 calls be made?

Also, does HDF5 have the concept of graceful shutdown - say if the system crashes before a record is written does HDF5 roll back to last valid point? Sort of like a database system.

Thank you for any information/help.

steven · September 16, 2021, 3:37pm

Hi Richard,
what programming language are you using?
best: steve

richard.haney · September 16, 2021, 3:38pm

Thank you for the response @Steven

I am using C

steven · September 16, 2021, 5:18pm

@richard.haney while ‘C’ is a great programming language, and the HDF5 CAPI is a very fast good scalable implementation I do recommend you C++, then re-exporting those subroutines to “C”.

Alternatively, if interested in maximum performance you need to use the H5D, H5F H5A … category function calls and build your own high performance packet table.

Probably you are not interested in C++ but if you changed your mind here is an excerpt CSV convertion to HDF5 example with compiler assisted introspection. You can find the recent slides here and documentation here

/*
 * Copyright (c) 2018-2020 Steven Varga, Toronto,ON Canada
 * Author: Varga, Steven <steven@vargaconsulting.ca>
 */

#include "csv.h"
// data structure include file: `struct.h` must precede 'generated.h' as the latter contains dependencies
// from previous
#include "struct.h"

#include <h5cpp/core>      // has handle + type descriptors
// sandwiched: as `h5cpp/io` depends on `henerated.h` which needs `h5cpp/core`
	#include "generated.h" // uses type descriptors
#include <h5cpp/io>        // uses generated.h + core 

int main(){

	// create HDF5 container
	h5::fd_t fd = h5::create("output.h5",H5F_ACC_TRUNC);
	// create dataset   
	// chunk size is unrealistically small, usually you would set this such that ~= 1MB or an ethernet jumbo frame size
	h5::ds_t ds = h5::create<input_t>(fd,  "simple approach/dataset.csv",
				 h5::max_dims{H5S_UNLIMITED}, h5::chunk{10} | h5::gzip{9} );
	// `h5::ds_t` handle is seamlessly cast to `h5::pt_t` packet table handle, this could have been done in single step
	// but we need `h5::ds_t` handle to add attributes
	h5::pt_t pt = ds;
	// attributes may be added to `h5::ds_t` handle
	ds["data set"] = "monroe-county-crash-data2003-to-2015.csv";
	ds["cvs parser"] = "https://github.com/ben-strasser/fast-cpp-csv-parser"; // thank you!

	constexpr unsigned N_COLS = 5;
	io::CSVReader<N_COLS> in("input.csv"); // number of cols may be less, than total columns in a row, we're to read only 5
	in.read_header(io::ignore_extra_column, "Master Record Number", "Hour", "Reported_Location","Latitude","Longitude");
	input_t row;                           // buffer to read line by line
	char* ptr;      // indirection, as `read_row` doesn't take array directly
	while(in.read_row(row.MasterRecordNumber, row.Hour, ptr, row.Latitude, row.Longitude)){
		strncpy(row.ReportedLocation, ptr, STR_ARRAY_SIZE); // defined in struct.h
		h5::append(pt, row);
		std::cout << std::string(ptr) << "\n";
	}
	// RAII closes all allocated resources
}

richard.haney · September 16, 2021, 5:34pm

Thank you @Steven for the information - very helpful.

gheber · September 28, 2021, 12:04pm

No. G.

(Post must be at least 20 characters.)

steven · September 28, 2021, 1:42pm

In addition to what @gheber posted; System crash must be considered an undefined state by definition, it is the integrator responsibility to roll back to previous known state.

You can build robust data solutions by:

using reliable data storage with the appropriate RAID level
distributed queues such as ZeroMQ, REDIS, …, allow you to replicate data
SCTP protocol based solutions can increase robustness with multi-path routing, and independent replication

Lower you go on the network stack more control you get, while increasing implementation complexity. To give you an example multicasting will maximise the throughput of your ethernet based interconnect, but is non-trivial to implement.