Is hdf5 suitable for real-time measurements ?

Nicolas · July 13, 2012, 12:31pm

Dear list,

I would like to know if hdf5 is suitable for real-time data logging or not ?

More precisely: I work on a project in which we want to continuously (sampling rate ranging form 30 to 400Hz) mix a fair amount of data (several hours) of different natures (telemetry, signals, videos).

Data have to be written in real-time (or with a small delay) in order to keep us from losing them on potential crash.

Our first prototype is based on sqlite3, however we feel that some limitations could rise from a long run usage: speed, one database == one file, and difficulties for accessing database from several threads (Lock exception when reading and writing at the same time).

So, I am considering the possibility to use hdf5 as a back-end for data storage on disk (and numpy/pytable for internal representation). Do you think it is possible to update hdf5 file on at a regular interval from such python binding ?

Thank you very much !

Cheer,
Nicolas

Ger_van_Diepen · July 16, 2012, 2:09pm

Hi Nicolas,

You are not very precise.
What is the amount of data you have to write at the rates you mention?
Is it all numeric, or also strings? Can the data be compressed?

Note that HDF5 does not support simultaneous access of readers and
writers. It is expected in a future version.
Also note that in case of a crash some internal HDF5 meta data may not
be up-to-date, so data can be lost. This is worked on using journaling.

For some of our astronomical applications we store the observational
meta data directly in an HDF5 file, while the main (numeric) data are
stored in an external file which is described in the HDF5 file, thus can
be accessed as HDF5 data.
The advantage is that no HDF5 overhead is involved and in case of a
crash about all data are safe.
However, beforehand you have to define the total data size.
Furthermore, the data need to be regularly shaped.

Cheers,
Ger

Nicolas <nblouveton@gmail.com> 7/13/2012 2:31 PM >>>

Dear list,

I would like to know if hdf5 is suitable for real-time data logging or
not ?

More precisely: I work on a project in which we want to continuously
(sampling rate ranging form 30 to 400Hz) mix a fair amount of data
(several hours) of different natures (telemetry, signals, videos).

Data have to be written in real-time (or with a small delay) in order
to
keep us from losing them on potential crash.

Our first prototype is based on sqlite3, however we feel that some
limitations could rise from a long run usage: speed, one database ==
one
file, and difficulties for accessing database from several threads
(Lock
exception when reading and writing at the same time).

So, I am considering the possibility to use hdf5 as a back-end for
data
storage on disk (and numpy/pytable for internal representation). Do
you
think it is possible to update hdf5 file on at a regular interval from
such python binding ?

Thank you very much !

Cheer,
Nicolas

···

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Dimitris_Servis · July 16, 2012, 2:10pm

Hi Nicolas

you wouldn't have so much trouble with the frequency as much as with the
threads. There is no way you can write at that rate to an HDF5 file. I
manage my locks pretty effectively and don't get any problems there but
still under high frequency access, HDF5 collapses at some point. If your
sources are separate processes, use one thread per process to write into
separate files and merge periodically.

HTH

-- dimitris

···

2012/7/13 Nicolas <nblouveton@gmail.com>

Dear list,

I would like to know if hdf5 is suitable for real-time data logging or not
?

More precisely: I work on a project in which we want to continuously
(sampling rate ranging form 30 to 400Hz) mix a fair amount of data (several
hours) of different natures (telemetry, signals, videos).

Data have to be written in real-time (or with a small delay) in order to
keep us from losing them on potential crash.

Our first prototype is based on sqlite3, however we feel that some
limitations could rise from a long run usage: speed, one database == one
file, and difficulties for accessing database from several threads (Lock
exception when reading and writing at the same time).

So, I am considering the possibility to use hdf5 as a back-end for data
storage on disk (and numpy/pytable for internal representation). Do you
think it is possible to update hdf5 file on at a regular interval from such
python binding ?

Thank you very much !

Cheer,
Nicolas

______________________________**_________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/**mailman/listinfo/hdf-forum_**hdfgroup.org<http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org>

Daniele_Nicolodi · July 16, 2012, 2:13pm

Independently on the storage format, logging to disc from a real-time thread is not a very good idea: accessing to file system storage (and to mass storage in general) has not deterministic latencies.

What is recommended to do in this cases is to have a real-time process or thread collect the data and store it in an in memory ring-buffer, from where it is collected by a non real-time process and written to disc. It is fairly trivial to implement a thread safe single reader, single writer ring-buffer (an you can find many implementations online).

Cheers,
Daniele

···

On 13/07/2012 14:31, Nicolas wrote:

I would like to know if hdf5 is suitable for real-time data logging or
not ?

So, I am considering the possibility to use hdf5 as a back-end for data
storage on disk (and numpy/pytable for internal representation). Do you
think it is possible to update hdf5 file on at a regular interval from
such python binding ?

Quincey_Koziol · July 16, 2012, 7:47pm

With this sort of mechanism, HDF5 would be well-suited to Nicolas' application. We know of users in the financial field who use HDF5 for high-throughput real-time data recording.

Quincey

···

On Jul 16, 2012, at 9:13 AM, Daniele Nicolodi wrote:

On 13/07/2012 14:31, Nicolas wrote:

I would like to know if hdf5 is suitable for real-time data logging or
not ?

So, I am considering the possibility to use hdf5 as a back-end for data
storage on disk (and numpy/pytable for internal representation). Do you
think it is possible to update hdf5 file on at a regular interval from
such python binding ?

Independently on the storage format, logging to disc from a real-time thread is not a very good idea: accessing to file system storage (and to mass storage in general) has not deterministic latencies.

What is recommended to do in this cases is to have a real-time process or thread collect the data and store it in an in memory ring-buffer, from where it is collected by a non real-time process and written to disc. It is fairly trivial to implement a thread safe single reader, single writer ring-buffer (an you can find many implementations online).

Nicolas · July 16, 2012, 3:13pm

Hi Ger,

Le 16/07/2012 16:09, Ger van Diepen a �crit :

What is the amount of data you have to write at the rates you mention? Is it all numeric, or also strings? Can the data be compressed?

Data are collected from several driving simulators connected to the same map (so we store numeric data about speed, position, engine dynamic etc.), eye-trackers (numeric) and message sent by several embedded devices (mostly string data).

However, beforehand you have to define the total data size. Furthermore, the data need to be regularly shaped.

It is impossible to know the total size of data as it depends of human behaviour.

You say you use meta-data hdf5 containers that in turn point on measurement data (also stored in hdf5). How do you achieve this ? Is it possible to point to datasets across files ?

Many thanks for everyone's answer.

Cheers,
Nicolas

James_Sharpe · July 16, 2012, 7:00pm

You're probably best looking at something like Clouderas flume, Twitters
storm project or mongo/couchdb for collecting your data. These also then
give you a computational framework within which to process the data.

James

···

On 16 July 2012 16:13, Nicolas <nblouveton@gmail.com> wrote:

Hi Ger,

Le 16/07/2012 16:09, Ger van Diepen a écrit :

What is the amount of data you have to write at the rates you mention? Is
it all numeric, or also strings? Can the data be compressed?

Data are collected from several driving simulators connected to the same
map (so we store numeric data about speed, position, engine dynamic etc.),
eye-trackers (numeric) and message sent by several embedded devices (mostly
string data).

However, beforehand you have to define the total data size. Furthermore,
the data need to be regularly shaped.

It is impossible to know the total size of data as it depends of human
behaviour.

You say you use meta-data hdf5 containers that in turn point on
measurement data (also stored in hdf5). How do you achieve this ? Is it
possible to point to datasets across files ?

Many thanks for everyone's answer.

Cheers,
Nicolas

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Is hdf5 suitable for real-time measurements ?