How to structure hierarchical data


#1

Hello,

Not sure if this is the right place to ask this question.

I would like to better learn how to convert my real world problems into nicely (and useful) structured HDF5 files.
At the moment I’m always second guessing myself, unsure if I should do something this way or that way.

So is there any recommended resources/tutorials/books how to best model collected data into a format suitable for post analysis through HDF5.

Maybe any useful tips and tricks or best practices to ensure captured data remains relevant into the future.

Although it shouldn’t matter, I am using a spectrometer and collecting all low level data and diagnostic information relating to the calibration and self test routines from the perspective of the instrument manufacturer.


#2

Hi Steve, yes it is.

From post analysis perspective, one should consider how data accessed respect to the quantity of data. With small size that that you can easily fit in core and still do your thing it is preferred to go with the most readable representation.

For large datasets, you should group data the way it meant to be accessed. To give you an example if you have:

struct my_pod_t {
   int field_1;
   char field_2[16];
}

the options are:

  1. as a stream of compound type, which makes recording straightforward, and post processing access by blocks easy and fast, but accessing only to my_pod_t::field_1 will result in loosing half of the disk bandwidth.
  2. two distinct datasets within a directory say: my_dir/field_1 and my_dir/field_2 will allow you to have fast sub field access at the cost of code complexity

For massive amount small size structured data, one should choose COMPACT storage, this usually observed in sensor networks collecting side-band information of some expensive machinery; think of particle colliders, tokamak, …

In the middle there are choices: CONTIGUOUS layout, ARRAY and SCALAR types are refined representations, meant to read/write dataset in a single shot; and the linear combinations allow you to represent complex structures.

For massive size data, you should to choose CHUNKED STORAGE layout, as this layout allows out of core processing, compression, … This would be the case with recording video streams, high resolution sensor panel time series sample etc…

Datasets may be classified as homogeneous where all elements are the same types, and compound datatypes. The former is often a hypercube, and to exploit existing mathematical structured you want to organise them so the slices give you meaningful vectors, matrices possibly tensors.
Compound datatypes require you to write a type descriptor which can be tedious.

This C++ implementation addresses the above problems, channels data to from major linear algebra systems, select the optimal representation as well as automatically generates the necessary compound type descriptors. Gerd Heber and I will give an online presentation on June 30th, 6:30 PM to 8:30 PM EDT, if any interest sign up here and ask away on the Q&A session.

best wishes: steven


#3

Most importantly, learn to stop worrying and accept that the perfect solution is illusory :slight_smile: It took me a while to get that point. Clues can be found in unexpected places, e.g., in one of my favorite books: The Music of Life by Denis Noble. The story of “The French bistro omlette” is a case in point. HDF5, DNA, genome, the folly of reductionism…


#4

It does matter in the sense that that defines your context. The general question is, “Given such and such a context, how is that context represented in HDF5, and what would consumers of the data expect to find in that context?” And, not to forget, “How does that context relate to other contexts?” I think to answer this last question goes toward “future-proofing” your data: in which (future) contexts will users want to come across the by then historical data. To establish and maintain those links, I think, is more important than the details of the data representation inside the HDF5 files/containers. Given a few clues, your future users and their agents will figure that out. Without linked contexts, it’s dumb search all over again. G.