HDF5 file structure for data logging

Hello, I'm really new to HDF5 and I'm trying to understand what would be
the best structure for "real time" logging of experimental data.

More in detail, I have a real time acquisition system that collects data
from multiple sources at a constant rate. Data logging is now performed
putting the data in a ring buffer and having a consumer non real time
process that takes the data from the ring buffer and writes them into an
ASCII file. The file is rotated after a certain number of samples have
been written.

Loading the data from ASCII file is however slow, and ASCII files are
not the best option for storing metadata. There fore I would like to
switch to HDF5 as storage format. However, it is not clear to me what
would be the most suitable structure for my files.

I imagine I need a structure to which I can append data, but it is not
clear to me if I should handle my data as a collection of columns (one
for each channel, plus one for the timestamp) or as a collection of
records (one record for each acquisition cycle) or if a table like
structure would be better.

Thank you very much for any hint on this topic. Best regards,

···

--
Daniele

Greetings Daniele,

The HDF5 group has produced something called the Packet Table which was designed for a similar purpose. See

http://www.hdfgroup.org/HDF5/doc/HL/H5PT_Intro.html

Perhaps this document will help to answer your question.

--dan

Daniele Nicolodi wrote:

···

Hello, I'm really new to HDF5 and I'm trying to understand what would be
the best structure for "real time" logging of experimental data.

More in detail, I have a real time acquisition system that collects data
from multiple sources at a constant rate. Data logging is now performed
putting the data in a ring buffer and having a consumer non real time
process that takes the data from the ring buffer and writes them into an
ASCII file. The file is rotated after a certain number of samples have
been written.

Loading the data from ASCII file is however slow, and ASCII files are
not the best option for storing metadata. There fore I would like to
switch to HDF5 as storage format. However, it is not clear to me what
would be the most suitable structure for my files.

I imagine I need a structure to which I can append data, but it is not
clear to me if I should handle my data as a collection of columns (one
for each channel, plus one for the timestamp) or as a collection of
records (one record for each acquisition cycle) or if a table like
structure would be better.

Thank you very much for any hint on this topic. Best regards,
  
--
Daniel Kahn
Science Systems and Applications Inc.
301-867-2162

There are a number of ways to do this sort of thing. The packet tables
as mentioned in another response is one thing. Packet tables have the
advantage that you don't need a priori knowledge of how many records
will fit in a table. There is, however, a performance hit for this.

You also have the option of either using compound types (structures) or
scalar types (if all of your measurements are the same type, e.g. all
double-precision floating point, or whatever). There can be a
performance hit with compound types relative to scalars as well.

I think in terms of performance, the options are as follows (roughly
lowest overhead to highest, it would be a good idea to thoroughly test a
number of different options to see which one gives you the best balance
of performance and capability):
1) pre-sized dataset with scalar types
2) pre-sized dataset with compound types
3) packet table with scalar types
4) packet table with compound types

In my own work, I use pre-sized datasets with compound types. Our data
are not homogeneous, so the compound types are pretty much required.
The pre-sized datasets affords us the ability to actively index the data
on read and write, which is of particular value on the read-side where
knowing exactly where to find the data you want is highly valuable.

If, on the other hand, you just need to store data that you're never
going to do partial reads of, packet tables are easier to implement.

Daniele Nicolodi wrote:

···

Hello, I'm really new to HDF5 and I'm trying to understand what would be
the best structure for "real time" logging of experimental data.

More in detail, I have a real time acquisition system that collects data
from multiple sources at a constant rate. Data logging is now performed
putting the data in a ring buffer and having a consumer non real time
process that takes the data from the ring buffer and writes them into an
ASCII file. The file is rotated after a certain number of samples have
been written.

Loading the data from ASCII file is however slow, and ASCII files are
not the best option for storing metadata. There fore I would like to
switch to HDF5 as storage format. However, it is not clear to me what
would be the most suitable structure for my files.

I imagine I need a structure to which I can append data, but it is not
clear to me if I should handle my data as a collection of columns (one
for each channel, plus one for the timestamp) or as a collection of
records (one record for each acquisition cycle) or if a table like
structure would be better.

Thank you very much for any hint on this topic. Best regards,

Dear John,

thank you very much for your hints. Partial reads and absolute
performances are not critical for my application. I think I'll give
packet tables a try. On the other hand, easily accessing the data from
Python and Matlab is a requirement.

Are packet tables (or the underlying structures) accessible in an easy
way from Matlab? Python has the h5py library, which, if it is not
already capable of handling packet tables, should be not too difficult
to extend.

A while back I investigated the kind of data structure used by PyTables,
which offers the kind of functionality I'm interested in. From my
aproximative knowledge of HDF5, I understood that it uses a "custom"
version of HDF5 Tables which support chunked writes. However I need
compatibility with Matlab and the ability of writing those files from a
C library. How such a structure compares with packed tables?

Thank you. Cheers,

···

On 18/08/11 20:49, John Knutson wrote:

In my own work, I use pre-sized datasets with compound types. Our data
are not homogeneous, so the compound types are pretty much required.
The pre-sized datasets affords us the ability to actively index the data
on read and write, which is of particular value on the read-side where
knowing exactly where to find the data you want is highly valuable.

If, on the other hand, you just need to store data that you're never
going to do partial reads of, packet tables are easier to implement.

--
Daniele

Daniele,

The Packet Table interface ends up creating a standard dataset that can be read just like any other dataset. But it does simplify setup and appending data.

In fact, at a code level, you must access it as a dataset (H5D) in order to do certain things... attributes & scales come to mind.

S

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-
bounces@hdfgroup.org] On Behalf Of Daniele Nicolodi
Sent: Thursday, August 18, 2011 4:28 PM
To: hdf-forum@hdfgroup.org
Subject: Re: [Hdf-forum] HDF5 file structure for data logging

On 18/08/11 20:49, John Knutson wrote:
> In my own work, I use pre-sized datasets with compound types. Our
data
> are not homogeneous, so the compound types are pretty much required.
> The pre-sized datasets affords us the ability to actively index the
data
> on read and write, which is of particular value on the read-side
where
> knowing exactly where to find the data you want is highly valuable.
>
> If, on the other hand, you just need to store data that you're never
> going to do partial reads of, packet tables are easier to implement.

Dear John,

thank you very much for your hints. Partial reads and absolute
performances are not critical for my application. I think I'll give
packet tables a try. On the other hand, easily accessing the data from
Python and Matlab is a requirement.

Are packet tables (or the underlying structures) accessible in an easy
way from Matlab? Python has the h5py library, which, if it is not
already capable of handling packet tables, should be not too difficult
to extend.

A while back I investigated the kind of data structure used by
PyTables,
which offers the kind of functionality I'm interested in. From my
aproximative knowledge of HDF5, I understood that it uses a "custom"
version of HDF5 Tables which support chunked writes. However I need
compatibility with Matlab and the ability of writing those files from a
C library. How such a structure compares with packed tables?

Thank you. Cheers,
--
Daniele

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.

Daniele Nicolodi wrote:

Dear John,

thank you very much for your hints. Partial reads and absolute
performances are not critical for my application. I think I'll give
packet tables a try. On the other hand, easily accessing the data from
Python and Matlab is a requirement.
  

I don't have experience with either, but in my attempts to interface my
own work with Octave, I found that it could *only* use scalar data
types, and I suspect Matlab may have the same restriction. Pytables is
perfectly capable of using compound types, however.

Are packet tables (or the underlying structures) accessible in an easy
way from Matlab? Python has the h5py library, which, if it is not
already capable of handling packet tables, should be not too difficult
to extend.
  

Packet tables are ultimately like any other dataset, as Scott Mitchell
indicated. Packet tables are more just a name for a high-level API that
makes it simple to build extensible datasets (obviously this can be done
in the low-level libraries that the H5PT interface is built on, but H5PT
makes the implementation easier). Unless, for some reason, the matlab
interface can't read datasets that have current dimensions != maximum
dimensions, I don't see why any interface wouldn't be able to read them
vs. an otherwise identical dataset.

A while back I investigated the kind of data structure used by PyTables,
which offers the kind of functionality I'm interested in. From my
aproximative knowledge of HDF5, I understood that it uses a "custom"
version of HDF5 Tables which support chunked writes. However I need
compatibility with Matlab and the ability of writing those files from a
C library. How such a structure compares with packed tables?

Thank you. Cheers,
  
I lack sufficient knowledge to answer that question, but I believe the
pytables author is active on this mailing list, so maybe he'll chime in
with pytables details. That said, with the exception of the scalar vs.
compound data type issue I'd already mentioned, I have a hard time
seeing how an HDF5 file written in one piece of software wouldn't be
readable in another. If the dataset is chunked vs. continuous, that
should be completely transparent to the reading application. I believe
that packet tables are chunked by definition as the library requires
that extensible datasets be chunked to allow for the extension.