On Thu, Aug 6, 2015 at 1:00 PM, Schneider, David A. < davidsch@slac.stanford.edu> wrote:
One current limitation of hdf5 is reading while writing. It will not be
convenient to read the data while it is being aquired over the hour.
Another limitation is robustness - I have less knowlege here, so take it
for what it is worth, but. If the system fails during the hour of
acquisition, if may be difficult to repair the file so you can get at the
acquired data. It is my understanding that these are features that are in
the works for Hdf5, currently there is a beta version of the SWMR mode
(single writer multiple readers) but it presently requires some
coordination between the readers and writings, as well as both have to be
linked against the new beta library (so for example, I don't think people
could use Matlab to read the data while it is being acquired, and they may
not be able to read it with Matlab after it is acquired). There is also a
journaling feature I've heard about with Hdf5 which would address the
robustness issue.
best,
David
Software engineer at SLAC
________________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of
Francesc Alted [faltet@gmail.com]
Sent: Thursday, August 6, 2015 9:19 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Seeking advice on HDF5 use case
Hi Peter,
2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com<mailto:
petr.klapka@valeo.com>>:
Good morning!
My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.
I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.
The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).
The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.
I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.
My questions to you are:
* Is HDF5 intended for data sets of this size and throughput given a
high performance Windows workstation?
Indeed HDF5 is a very good option for what you are trying to do.
* What is the "correct" usage pattern for this scenario?
* Is it to use a "Group" for each device, and create a "Dataset"
for each sample? This would result in thousands of datasets in the file
per group, but I fully understand how to navigate this structure.
No, creating too many datasets will slow down your queries a lot later on.
* Or should there only be four "Datasets" that are extensible, and
each sensor "sample" be appended into the dataset?
IMO, this is the way to go. You can append your array of structs to the
dataset that is created initially empty.
* If this is the case, can the dataset itself be searched for
specific samples by time and metadata?
In case your time samples are equally binned, you could use dimension
scales for that. But in general HDF5 does not allow you to do queries on
non-uniform time series or other fields, and you should do a full scan for
that.
If you want to avoid the full scan for table queries, you will need to use
3rd party apps on top of HDF5. For example, the indexing capabilities in
PyTables can help:
http://www.pytables.org/usersguide/optimization.html#indexed-searches
Also, you may want to use either Pandas or TsTables:
http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables
http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/
However, all of the above packages are Python packages, so not sure if
they would fit your scenario.
* Or is this use case appropriate for the Table API?
The Table API is perfectly compatible with the above suggestion of using a
large dataset for storing the time series (in fact, this is the API that
PyTables uses behind the scenes).
I will begin with prototyping the first scenario, since it is the most
straight forward to understand and implement. Please let me know your
suggestions. Many thanks!
Hope this helps,
--
Francesc Alted
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *