Seeking advice on HDF5 use case

Petr_KLAPKA · August 6, 2015, 2:46pm

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

   - Is HDF5 intended for data sets of this size and throughput given a
   high performance Windows workstation?
   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset" for
      each sample? This would result in thousands of datasets in the file per
      group, but I fully understand how to navigate this structure.
      - Or should there only be four "Datasets" that are extensible, and
      each sensor "sample" be appended into the dataset? If this is the case,
      can the dataset itself be searched for specific samples by time and
      metadata?
      - Or is this use case appropriate for the Table API?

I will begin with prototyping the first scenario, since it is the most
straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

···

--

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

nevion · August 6, 2015, 10:33pm

I've had good luck with the packet table api from University of Illinois /
Boeing https://www.hdfgroup.org/HDF5/doc/HL/H5PT_Intro.html - it's built on
the table api. I use the C++ wrapper for it, and while it's had some rough
edges it generally works the way you want it to. You would not use any
filters and only native datatypes that do not need conversion for packet
tables, but you can always compress them at a later time.

One thing you might want to keep in mind though is that HDF is not robust
to sudden crashes of the system, hardware or software. To that ends, the
more important and simple the data is, the more I write an HDF datatype
descriptor file and then dump those "datasets" to individual files I can
use the type information from an HDF file to parse. This lets you keep all
the guarantees of the lower level file handling - if you know what you're
doing / use atomic writes and stuff while keeping the self-describing
capability of HDF. Afterwards I write some simple python scripts with
h5py/numpy to pull in those datasets to HDF proper... it's simple method
and gives you the best of both worlds for reliability, compression, and
archivability.

-Jason

···

On Thu, Aug 6, 2015 at 7:46 AM, Petr KLAPKA <petr.klapka@valeo.com> wrote:

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

   - Is HDF5 intended for data sets of this size and throughput given a
   high performance Windows workstation?
   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset"
      for each sample? This would result in thousands of datasets in the file
      per group, but I fully understand how to navigate this structure.
      - Or should there only be four "Datasets" that are extensible, and
      each sensor "sample" be appended into the dataset? If this is the case,
      can the dataset itself be searched for specific samples by time and
      metadata?
      - Or is this use case appropriate for the Table API?

I will begin with prototyping the first scenario, since it is the most
straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

faltet · August 6, 2015, 4:19pm

Hi Peter,

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

- Is HDF5 intended for data sets of this size and throughput given a
high performance Windows workstation?

Indeed HDF5 is a very good option for what you are trying to do.

   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset"
      for each sample? This would result in thousands of datasets in the file
      per group, but I fully understand how to navigate this structure.

No, creating too many datasets will slow down your queries a lot later on.

- Or should there only be four "Datasets" that are extensible, and
each sensor "sample" be appended into the dataset?

IMO, this is the way to go. You can append your array of structs to the

dataset that is created initially empty.

- If this is the case, can the dataset itself be searched for
specific samples by time and metadata?

In case your time samples are equally binned, you could use dimension
scales for that. But in general HDF5 does not allow you to do queries on
non-uniform time series or other fields, and you should do a full scan for
that.

If you want to avoid the full scan for table queries, you will need to use
3rd party apps on top of HDF5. For example, the indexing capabilities in
PyTables can help:

http://www.pytables.org/usersguide/optimization.html#indexed-searches

Also, you may want to use either Pandas or TsTables:

http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables
http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/

However, all of the above packages are Python packages, so not sure if they
would fit your scenario.

- Or is this use case appropriate for the Table API?

The Table API is perfectly compatible with the above suggestion of using a

large dataset for storing the time series (in fact, this is the API that
PyTables uses behind the scenes).

I will begin with prototyping the first scenario, since it is the most

straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Hope this helps,

···

2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com>:

--
Francesc Alted

mark.koennecke · August 10, 2015, 12:17pm

Dear Petr Klapka,

···

Am 06.08.2015 um 16:46 schrieb Petr KLAPKA <petr.klapka@valeo.com<mailto:petr.klapka@valeo.com>>:

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of evaluating HDF5 as a potential file format for a data acquisition tool.

We use HDF5 in data acquisition at SINQ, PSI. Other places, like synchrotron sources which generate much more data then we, too.

I have been working through the HDF5 tutorials and overcoming the API learning curve. I was hoping you could offer some advice on the suitability of HDF5 for our intended purpose and perhaps save me the time of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a device provides a sample. The sample is an array of structs. The total size of the array varies but will be on average around 8 kilobytes. (160k per second per device).

The data will need to be recorded over a period of about an hour, meaning an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the file, seeking on sample meta data and time.

My questions to you are:

* Is HDF5 intended for data sets of this size and throughput given a high performance Windows workstation?

Sure, HDF5 excels for this kind of data sizes. Only 2.3 GB…. But if you use Windows you throw away most of the cababilities of your machine.

* What is the "correct" usage pattern for this scenario?
* Is it to use a "Group" for each device, and create a "Dataset" for each sample? This would result in thousands of datasets in the file per group, but I fully understand how to navigate this structure.

This will not perform well. HDF does not like to have many small structures generated.

* Or should there only be four "Datasets" that are extensible, and each sensor "sample" be appended into the dataset? If this is the case, can the dataset itself be searched for specific samples by time and metadata?

This is much better. Appending to a large array is good. You may have to play with the chunking to get this working well. HDF5 does not support explicit searching.
But if I understand your use case right, it would help to have another array with time stamps running alongside. Then search for a specific time interval
comes down finding the index range of the interesting time range in the time stamp array and then retrieve the slabs of data from the arrays by the indexes just
found.

I know to little about your other meta data to suggest something.

* Or is this use case appropriate for the Table API?

Could be, I do not know that API well enough,

Best Regards,

Mark Könnecke

I will begin with prototyping the first scenario, since it is the most straight forward to understand and implement. Please let me know your suggestions. Many thanks!

Best regards,

Petr Klapka
System Tools Engineer
Valeo Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
"Festina lente."

This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Petr_KLAPKA · August 6, 2015, 5:04pm

Thank you for the prompt response Francesc!

Regarding the Python suggestions, our application is developed in C#, so
I'm using the HDF5DotNet "wrapper" classes. Yesterday I absorbed the basic
HDF5 tutorial, today I'm trying to stream structs from a C# application to
an HDF5 file. I'm fighting my way through creating the COMPOUND type now,
jumping between the basic documentation and that for HDF5DotNet .

Based on your suggestions, my first leaning was toward using tables and the
TB functions. At first glance however, I can't find many of the H5TB
functions implemented in the HDF5DotNet wrapper. I don't have the time
budget to write my own wrapper classes for functions like H5TBmake_table
and H5TBread_table. Unless those exist in the HDF5DotNet API in a location
other than H5TB class (only appears to contain getTableInfo and
getFieldInfo methods), that approach is out. Unless I am missing something?

Looking at your recommended approach of using an extensible dataset and
appending my sample "array of structs" to it every 50ms. Every sample has
a unique identifier. I need to be able to quickly locate the identifier
during reading. Since each sample consists of between 0 and 500 elements
of my struct, I cannot rely on equal spacing, unless I do a whole lot of
padding.

Before finding out about HDF5, my plan was to write the data to a plain
binary file interleaved by the device and to maintain an "index" for each
devices samples which would then be written to the end of the file and
"read first" upon opening the file to avoid a computationally expensive
re-indexing process.

Could such an "index" be created as it's own dataset, containing the unique
identifier for each "sample" and an offset in the "sample" dataset array at
which it is found?

It seems doing it this way would work, but I was hoping HDF5 would solve my
indexing problem for me. I'm reluctant to continue investing the time
needed to learn the API and code up the interop for my types (there is more
to this than what I give in my example) if in the end I still have to do my
own indexing...

What are your thoughts?

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

···

On Thu, Aug 6, 2015 at 12:19 PM, Francesc Alted <faltet@gmail.com> wrote:

Hi Peter,

2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com>:

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

   - Is HDF5 intended for data sets of this size and throughput given a
   high performance Windows workstation?

Indeed HDF5 is a very good option for what you are trying to do.

   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset"
      for each sample? This would result in thousands of datasets in the file
      per group, but I fully understand how to navigate this structure.

No, creating too many datasets will slow down your queries a lot later on.

   - Or should there only be four "Datasets" that are extensible, and
      each sensor "sample" be appended into the dataset?

IMO, this is the way to go. You can append your array of structs to the

dataset that is created initially empty.

   - If this is the case, can the dataset itself be searched for
      specific samples by time and metadata?

In case your time samples are equally binned, you could use dimension
scales for that. But in general HDF5 does not allow you to do queries on
non-uniform time series or other fields, and you should do a full scan for
that.

If you want to avoid the full scan for table queries, you will need to use
3rd party apps on top of HDF5. For example, the indexing capabilities in
PyTables can help:

http://www.pytables.org/usersguide/optimization.html#indexed-searches

Also, you may want to use either Pandas or TsTables:

http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables

http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/

However, all of the above packages are Python packages, so not sure if
they would fit your scenario.

   - Or is this use case appropriate for the Table API?

The Table API is perfectly compatible with the above suggestion of using

a large dataset for storing the time series (in fact, this is the API that
PyTables uses behind the scenes).

I will begin with prototyping the first scenario, since it is the most

straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Hope this helps,

--
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

Petr_KLAPKA · August 6, 2015, 5:09pm

Thanks David,

The files will not be read while they are being written, so that's not an
issue. You do bring up a good point about robustness however. Were I to
use a simple binary file with interleaved "samples" from my different
devices and an Index table at the end for fast access, I could always
recover a "corrupt" file. How would HDF5 deal with mangled data if one of
our drives decides to have a hiccup during recording?

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

···

On Thu, Aug 6, 2015 at 1:00 PM, Schneider, David A. < davidsch@slac.stanford.edu> wrote:

One current limitation of hdf5 is reading while writing. It will not be
convenient to read the data while it is being aquired over the hour.
Another limitation is robustness - I have less knowlege here, so take it
for what it is worth, but. If the system fails during the hour of
acquisition, if may be difficult to repair the file so you can get at the
acquired data. It is my understanding that these are features that are in
the works for Hdf5, currently there is a beta version of the SWMR mode
(single writer multiple readers) but it presently requires some
coordination between the readers and writings, as well as both have to be
linked against the new beta library (so for example, I don't think people
could use Matlab to read the data while it is being acquired, and they may
not be able to read it with Matlab after it is acquired). There is also a
journaling feature I've heard about with Hdf5 which would address the
robustness issue.

best,

David
Software engineer at SLAC

________________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of
Francesc Alted [faltet@gmail.com]
Sent: Thursday, August 6, 2015 9:19 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Seeking advice on HDF5 use case

Hi Peter,

2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com<mailto:
petr.klapka@valeo.com>>:
Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

  * Is HDF5 intended for data sets of this size and throughput given a
high performance Windows workstation?

Indeed HDF5 is a very good option for what you are trying to do.

  * What is the "correct" usage pattern for this scenario?
     * Is it to use a "Group" for each device, and create a "Dataset"
for each sample? This would result in thousands of datasets in the file
per group, but I fully understand how to navigate this structure.

No, creating too many datasets will slow down your queries a lot later on.

     * Or should there only be four "Datasets" that are extensible, and
each sensor "sample" be appended into the dataset?

IMO, this is the way to go. You can append your array of structs to the
dataset that is created initially empty.

     * If this is the case, can the dataset itself be searched for
specific samples by time and metadata?

In case your time samples are equally binned, you could use dimension
scales for that. But in general HDF5 does not allow you to do queries on
non-uniform time series or other fields, and you should do a full scan for
that.

If you want to avoid the full scan for table queries, you will need to use
3rd party apps on top of HDF5. For example, the indexing capabilities in
PyTables can help:

http://www.pytables.org/usersguide/optimization.html#indexed-searches

Also, you may want to use either Pandas or TsTables:

http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables

http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/

However, all of the above packages are Python packages, so not sure if
they would fit your scenario.

     * Or is this use case appropriate for the Table API?

The Table API is perfectly compatible with the above suggestion of using a
large dataset for storing the time series (in fact, this is the API that
PyTables uses behind the scenes).

I will begin with prototyping the first scenario, since it is the most
straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Hope this helps,

--
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

David_Schneider · August 6, 2015, 5:00pm

One current limitation of hdf5 is reading while writing. It will not be convenient to read the data while it is being aquired over the hour. Another limitation is robustness - I have less knowlege here, so take it for what it is worth, but. If the system fails during the hour of acquisition, if may be difficult to repair the file so you can get at the acquired data. It is my understanding that these are features that are in the works for Hdf5, currently there is a beta version of the SWMR mode (single writer multiple readers) but it presently requires some coordination between the readers and writings, as well as both have to be linked against the new beta library (so for example, I don't think people could use Matlab to read the data while it is being acquired, and they may not be able to read it with Matlab after it is acquired). There is also a journaling feature I've heard about with Hdf5 which would address the robustness issue.

best,

David
Software engineer at SLAC

···

________________________________________
From: Hdf-forum [hdf-forum-bounces@lists.hdfgroup.org] on behalf of Francesc Alted [faltet@gmail.com]
Sent: Thursday, August 6, 2015 9:19 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Seeking advice on HDF5 use case

Hi Peter,

2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com<mailto:petr.klapka@valeo.com>>:
Good morning!

My name is Petr Klapka, My colleagues and I are in the process of evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API learning curve. I was hoping you could offer some advice on the suitability of HDF5 for our intended purpose and perhaps save me the time of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a device provides a sample. The sample is an array of structs. The total size of the array varies but will be on average around 8 kilobytes. (160k per second per device).

The data will need to be recorded over a period of about an hour, meaning an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the file, seeking on sample meta data and time.

My questions to you are:

* Is HDF5 intended for data sets of this size and throughput given a high performance Windows workstation?

Indeed HDF5 is a very good option for what you are trying to do.

* What is the "correct" usage pattern for this scenario?
* Is it to use a "Group" for each device, and create a "Dataset" for each sample? This would result in thousands of datasets in the file per group, but I fully understand how to navigate this structure.

No, creating too many datasets will slow down your queries a lot later on.

* Or should there only be four "Datasets" that are extensible, and each sensor "sample" be appended into the dataset?

IMO, this is the way to go. You can append your array of structs to the dataset that is created initially empty.

* If this is the case, can the dataset itself be searched for specific samples by time and metadata?

In case your time samples are equally binned, you could use dimension scales for that. But in general HDF5 does not allow you to do queries on non-uniform time series or other fields, and you should do a full scan for that.

If you want to avoid the full scan for table queries, you will need to use 3rd party apps on top of HDF5. For example, the indexing capabilities in PyTables can help:

http://www.pytables.org/usersguide/optimization.html#indexed-searches

Also, you may want to use either Pandas or TsTables:

http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables
http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/

However, all of the above packages are Python packages, so not sure if they would fit your scenario.

* Or is this use case appropriate for the Table API?

The Table API is perfectly compatible with the above suggestion of using a large dataset for storing the time series (in fact, this is the API that PyTables uses behind the scenes).

I will begin with prototyping the first scenario, since it is the most straight forward to understand and implement. Please let me know your suggestions. Many thanks!

Hope this helps,

--
Francesc Alted

faltet · August 6, 2015, 10:00pm

Thank you for the prompt response Francesc!

Regarding the Python suggestions, our application is developed in C#, so
I'm using the HDF5DotNet "wrapper" classes. Yesterday I absorbed the basic
HDF5 tutorial, today I'm trying to stream structs from a C# application to
an HDF5 file. I'm fighting my way through creating the COMPOUND type now,
jumping between the basic documentation and that for HDF5DotNet .

Based on your suggestions, my first leaning was toward using tables and
the TB functions. At first glance however, I can't find many of the H5TB
functions implemented in the HDF5DotNet wrapper. I don't have the time
budget to write my own wrapper classes for functions like H5TBmake_table
and H5TBread_table. Unless those exist in the HDF5DotNet API in a location
other than H5TB class (only appears to contain getTableInfo and
getFieldInfo methods), that approach is out. Unless I am missing something?

Yes, probably the Table API is not supported for .Net.

Looking at your recommended approach of using an extensible dataset and
appending my sample "array of structs" to it every 50ms. Every sample has
a unique identifier. I need to be able to quickly locate the identifier
during reading. Since each sample consists of between 0 and 500 elements
of my struct, I cannot rely on equal spacing, unless I do a whole lot of
padding.

Before finding out about HDF5, my plan was to write the data to a plain
binary file interleaved by the device and to maintain an "index" for each
devices samples which would then be written to the end of the file and
"read first" upon opening the file to avoid a computationally expensive
re-indexing process.

Yes, a custom-made index is a perfectly viable solution.

Could such an "index" be created as it's own dataset, containing the
unique identifier for each "sample" and an offset in the "sample" dataset
array at which it is found?

That would be my recommendation, yes.

It seems doing it this way would work, but I was hoping HDF5 would solve
my indexing problem for me. I'm reluctant to continue investing the time
needed to learn the API and code up the interop for my types (there is more
to this than what I give in my example) if in the end I still have to do my
own indexing...

What are your thoughts?

Well, I think that creating your own index would not be that hard. Luck!

Francesc

···

2015-08-06 19:04 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com>:

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

On Thu, Aug 6, 2015 at 12:19 PM, Francesc Alted <faltet@gmail.com> wrote:

Hi Peter,

2015-08-06 16:46 GMT+02:00 Petr KLAPKA <petr.klapka@valeo.com>:

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour,
meaning an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

   - Is HDF5 intended for data sets of this size and throughput given a
   high performance Windows workstation?

Indeed HDF5 is a very good option for what you are trying to do.

   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset"
      for each sample? This would result in thousands of datasets in the file
      per group, but I fully understand how to navigate this structure.

No, creating too many datasets will slow down your queries a lot later

on.

   - Or should there only be four "Datasets" that are extensible, and
      each sensor "sample" be appended into the dataset?

IMO, this is the way to go. You can append your array of structs to the

dataset that is created initially empty.

   - If this is the case, can the dataset itself be searched for
      specific samples by time and metadata?

In case your time samples are equally binned, you could use dimension
scales for that. But in general HDF5 does not allow you to do queries on
non-uniform time series or other fields, and you should do a full scan for
that.

If you want to avoid the full scan for table queries, you will need to
use 3rd party apps on top of HDF5. For example, the indexing capabilities
in PyTables can help:

http://www.pytables.org/usersguide/optimization.html#indexed-searches

Also, you may want to use either Pandas or TsTables:

http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables

http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/

However, all of the above packages are Python packages, so not sure if
they would fit your scenario.

   - Or is this use case appropriate for the Table API?

The Table API is perfectly compatible with the above suggestion of using

a large dataset for storing the time series (in fact, this is the API that
PyTables uses behind the scenes).

I will begin with prototyping the first scenario, since it is the most

straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Hope this helps,

--
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Francesc Alted

Petr_KLAPKA · August 10, 2015, 6:58pm

Thanks Mark, this is the path I'm on now.

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

···

On Mon, Aug 10, 2015 at 8:17 AM, Koennecke Mark (PSI) <mark.koennecke@psi.ch > wrote:

Dear Petr Klapka,

Am 06.08.2015 um 16:46 schrieb Petr KLAPKA <petr.klapka@valeo.com>:

Good morning!

My name is Petr Klapka, My colleagues and I are in the process of
evaluating HDF5 as a potential file format for a data acquisition tool.

We use HDF5 in data acquisition at SINQ, PSI. Other places, like
synchrotron sources which generate much more data then we, too.

I have been working through the HDF5 tutorials and overcoming the API
learning curve. I was hoping you could offer some advice on the
suitability of HDF5 for our intended purpose and perhaps save me the time
of mis-using the format or API.

The data being acquired are "samples" from four devices. Every ~50ms a
device provides a sample. The sample is an array of structs. The total
size of the array varies but will be on average around 8 kilobytes. (160k
per second per device).

The data will need to be recorded over a period of about an hour, meaning
an uncompressed file size of around 2.3 Gigabytes.

I will need to "play back" these samples, as well as jump around in the
file, seeking on sample meta data and time.

My questions to you are:

   - Is HDF5 intended for data sets of this size and throughput given a
   high performance Windows workstation?

Sure, HDF5 excels for this kind of data sizes. Only 2.3 GB…. But if you
use Windows you throw away most of the cababilities of your machine.

   - What is the "correct" usage pattern for this scenario?
      - Is it to use a "Group" for each device, and create a "Dataset"
      for each sample? This would result in thousands of datasets in the file
      per group, but I fully understand how to navigate this structure.

This will not perform well. HDF does not like to have many small
structures generated.

   - Or should there only be four "Datasets" that are extensible, and
      each sensor "sample" be appended into the dataset? If this is the case,
      can the dataset itself be searched for specific samples by time and
      metadata?

This is much better. Appending to a large array is good. You may have to
play with the chunking to get this working well. HDF5 does not support
explicit searching.
But if I understand your use case right, it would help to have another
array with time stamps running alongside. Then search for a specific time
interval
comes down finding the index range of the interesting time range in the
time stamp array and then retrieve the slabs of data from the arrays by the
indexes just
found.

I know to little about your other meta data to suggest something.

   - Or is this use case appropriate for the Table API?

Could be, I do not know that API well enough,

Best Regards,

     Mark Könnecke

I will begin with prototyping the first scenario, since it is the most
straight forward to understand and implement. Please let me know your
suggestions. Many thanks!

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its sender
at the above address and destroy it. *

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Seeking advice on HDF5 use case