hdf suitability for packetized data

Val_Schmidt · February 10, 2011, 11:09pm

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a suitable file format for my application. The data I'm interested to store is usually written by the collecting instrument to basic binary files of concatenated packets (think c structures), each of which contains a header with a time stamp, packet format, packet identifier, and packet size followed by the data itself (arrays) and associated metadata. There are 10's of types of packets that may come in any order and they are usually written to the file sequentially. Packets contain from 10-100 fields, some of which may be arrays of data of various sizes.

This format allows one to relatively quickly index a file by passing through the file and parsing only these headers. Then one can use the index to pull subsets of the data in a non-linear fashion, sometimes simultaneously in multiple threads for quite fast reading. The problem is that every instrument manufacturer has their own method of encoding packets and a single format is needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such that the same kind of indexing and parallel data retrieval is possible? What is to be avoided is the need to read through a file sequentially to get to the fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am inexperienced and because most folks using it seem to be storing relatively small numbers of very large arrays (imagery in many cases), rather than relatively large numbers of smaller numbers of fields and smaller arrays, it is not clear to me how such an implementation might perform. So I guess I'm also asking, what is the relative penalty for writing lots of small sets of data?

I hope this makes sense.

Thanks in advance,

Val

···

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

Mitchell_Scott_AES · February 11, 2011, 3:56pm

I'm doing something similar to what you are looking at. I have data coming in from multiple instruments which go through processing and result in one or several C# structures/arrays. In my example each instrument type has a structure containing Packet Tables with associated time axes/scales. The packet table structure mimics the instrument data structures.

Metadata is held in Attributes and other Packet Tables. I've created a standard across the program, with specifics defined for each instrument.

I end up storing each individual instrument's data in its own file. In most cases, a single thread processes and stores data, so I don't have to worry about synchronization (as much).

I believe you'll want to store each data type in its own dataset or file. For the ability to search by data type and data length issues. How are you expecting to search?

In my case, we allow users to 'play back' the data. I have the time scale as a separate dataset so I can do random access lookups without having to load large data records to find a specific time.

Scott

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Thursday, February 10, 2011 6:10 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] hdf suitability for packetized data

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a
suitable file format for my application. The data I'm interested to store is
usually written by the collecting instrument to basic binary files of
concatenated packets (think c structures), each of which contains a header
with a time stamp, packet format, packet identifier, and packet size followed
by the data itself (arrays) and associated metadata. There are 10's of types
of packets that may come in any order and they are usually written to the file
sequentially. Packets contain from 10-100 fields, some of which may be arrays
of data of various sizes.

This format allows one to relatively quickly index a file by passing through
the file and parsing only these headers. Then one can use the index to pull
subsets of the data in a non-linear fashion, sometimes simultaneously in
multiple threads for quite fast reading. The problem is that every instrument
manufacturer has their own method of encoding packets and a single format is
needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such
that the same kind of indexing and parallel data retrieval is possible? What
is to be avoided is the need to read through a file sequentially to get to the
fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am
inexperienced and because most folks using it seem to be storing relatively
small numbers of very large arrays (imagery in many cases), rather than
relatively large numbers of smaller numbers of fields and smaller arrays, it
is not clear to me how such an implementation might perform. So I guess I'm
also asking, what is the relative penalty for writing lots of small sets of
data?

I hope this makes sense.

Thanks in advance,

Val
------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.

Val_Schmidt · February 11, 2011, 4:28pm

Your question is a good one.

I would need to be able to pull a full record (or set of records) within a set of time bounds.
I would need to be able to pull some field from all records for all times - as a time series.
I might need to be able to pull all the fields within some field range for all times.

I'm thinking of something similar to what you have done (I think) - that is, to self-index the file. The index would be in it's own dataset with an array of time records and perhaps a few other fields and relative links (I forget what HDF5 calls them) to the actual data records.

-Val

···

On Feb 11, 2011, at 10:56 AM, Mitchell, Scott - IS wrote:

I'm doing something similar to what you are looking at. I have data coming in from multiple instruments which go through processing and result in one or several C# structures/arrays. In my example each instrument type has a structure containing Packet Tables with associated time axes/scales. The packet table structure mimics the instrument data structures.

Metadata is held in Attributes and other Packet Tables. I've created a standard across the program, with specifics defined for each instrument.

I end up storing each individual instrument's data in its own file. In most cases, a single thread processes and stores data, so I don't have to worry about synchronization (as much).

I believe you'll want to store each data type in its own dataset or file. For the ability to search by data type and data length issues. How are you expecting to search?

In my case, we allow users to 'play back' the data. I have the time scale as a separate dataset so I can do random access lookups without having to load large data records to find a specific time.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Thursday, February 10, 2011 6:10 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] hdf suitability for packetized data

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a
suitable file format for my application. The data I'm interested to store is
usually written by the collecting instrument to basic binary files of
concatenated packets (think c structures), each of which contains a header
with a time stamp, packet format, packet identifier, and packet size followed
by the data itself (arrays) and associated metadata. There are 10's of types
of packets that may come in any order and they are usually written to the file
sequentially. Packets contain from 10-100 fields, some of which may be arrays
of data of various sizes.

This format allows one to relatively quickly index a file by passing through
the file and parsing only these headers. Then one can use the index to pull
subsets of the data in a non-linear fashion, sometimes simultaneously in
multiple threads for quite fast reading. The problem is that every instrument
manufacturer has their own method of encoding packets and a single format is
needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such
that the same kind of indexing and parallel data retrieval is possible? What
is to be avoided is the need to read through a file sequentially to get to the
fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am
inexperienced and because most folks using it seem to be storing relatively
small numbers of very large arrays (imagery in many cases), rather than
relatively large numbers of smaller numbers of fields and smaller arrays, it
is not clear to me how such an implementation might perform. So I guess I'm
also asking, what is the relative penalty for writing lots of small sets of
data?

I hope this makes sense.

Thanks in advance,

Val
------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

Mitchell_Scott_AES · February 11, 2011, 4:54pm

The first search is pretty straight forward. My link is pretty simple, there's a 1:1 correspondence between the line numbers in the time scale & the dataset.

The two other searches have to be brute forced from within the Packet Table interface (H5PT) by iterating each line to just pull the individual field(s). There may be a better way from the dataset (H5D). I've stuck with the PT interface because I generally grab the whole dataset and it simplifies the process of adding new data.

Scott

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Friday, February 11, 2011 11:28 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf suitability for packetized data

Your question is a good one.

I would need to be able to pull a full record (or set of records) within a set
of time bounds.
I would need to be able to pull some field from all records for all times - as
a time series.
I might need to be able to pull all the fields within some field range for all
times.

I'm thinking of something similar to what you have done (I think) - that is,
to self-index the file. The index would be in it's own dataset with an array
of time records and perhaps a few other fields and relative links (I forget
what HDF5 calls them) to the actual data records.

-Val

On Feb 11, 2011, at 10:56 AM, Mitchell, Scott - IS wrote:

> I'm doing something similar to what you are looking at. I have data coming
in from multiple instruments which go through processing and result in one or
several C# structures/arrays. In my example each instrument type has a
structure containing Packet Tables with associated time axes/scales. The
packet table structure mimics the instrument data structures.
>
> Metadata is held in Attributes and other Packet Tables. I've created a
standard across the program, with specifics defined for each instrument.
>
> I end up storing each individual instrument's data in its own file. In most
cases, a single thread processes and stores data, so I don't have to worry
about synchronization (as much).
>
>
> I believe you'll want to store each data type in its own dataset or file.
For the ability to search by data type and data length issues. How are you
expecting to search?
>
> In my case, we allow users to 'play back' the data. I have the time scale as
a separate dataset so I can do random access lookups without having to load
large data records to find a specific time.
>
>
> Scott
>
>> -----Original Message-----
>> From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-
bounces@hdfgroup.org]
>> On Behalf Of Val Schmidt
>> Sent: Thursday, February 10, 2011 6:10 PM
>> To: hdf-forum@hdfgroup.org
>> Subject: [Hdf-forum] hdf suitability for packetized data
>>
>> Hello everyone,
>>
>> I am new to HDF and am trying to understand whether or not it might be a
>> suitable file format for my application. The data I'm interested to store
is
>> usually written by the collecting instrument to basic binary files of
>> concatenated packets (think c structures), each of which contains a header
>> with a time stamp, packet format, packet identifier, and packet size
followed
>> by the data itself (arrays) and associated metadata. There are 10's of
types
>> of packets that may come in any order and they are usually written to the
file
>> sequentially. Packets contain from 10-100 fields, some of which may be
arrays
>> of data of various sizes.
>>
>> This format allows one to relatively quickly index a file by passing
through
>> the file and parsing only these headers. Then one can use the index to pull
>> subsets of the data in a non-linear fashion, sometimes simultaneously in
>> multiple threads for quite fast reading. The problem is that every
instrument
>> manufacturer has their own method of encoding packets and a single format
is
>> needed for archival purposes.
>>
>> My question to you is how might a similar model be implemented in HDF5 such
>> that the same kind of indexing and parallel data retrieval is possible?
What
>> is to be avoided is the need to read through a file sequentially to get to
the
>> fields to extract.
>>
>> It seems like HDF5 should handle this kind of thing well, but because I am
>> inexperienced and because most folks using it seem to be storing relatively
>> small numbers of very large arrays (imagery in many cases), rather than
>> relatively large numbers of smaller numbers of fields and smaller arrays,
it
>> is not clear to me how such an implementation might perform. So I guess I'm
>> also asking, what is the relative penalty for writing lots of small sets of
>> data?
>>
>> I hope this makes sense.
>>
>> Thanks in advance,
>>
>> Val
>> ------------------------------------------------------
>> Val Schmidt
>> CCOM/JHC
>> University of New Hampshire
>> Chase Ocean Engineering Lab
>> 24 Colovos Road
>> Durham, NH 03824
>> e: vschmidt [AT] ccom.unh.edu
>> m: 614.286.3726
>>
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> This e-mail and any files transmitted with it may be proprietary and are
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender.
> Please note that any views or opinions presented in this e-mail are solely
those of the author and do not necessarily represent those of ITT Corporation.
The recipient should check this e-mail and any attachments for the presence of
viruses. ITT accepts no liability for any damage caused by any virus
transmitted by this e-mail.
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Quincey_Koziol · February 12, 2011, 12:45am

Your question is a good one.

I would need to be able to pull a full record (or set of records) within a set of time bounds.
I would need to be able to pull some field from all records for all times - as a time series.
I might need to be able to pull all the fields within some field range for all times.

I'm thinking of something similar to what you have done (I think) - that is, to self-index the file. The index would be in it's own dataset with an array of time records and perhaps a few other fields and relative links (I forget what HDF5 calls them) to the actual data records.

We're not completely there yet, but the 1.10.0 release will have improvements to how datasets are stored that will speed up the ingest/write side of things. We also have some projects in the works to add indexing to HDF5 files, but they are very early still and probably won't be available in the 1.10.0 release.

Quincey

···

On Feb 11, 2011, at 8:28 AM, Val Schmidt wrote:

-Val

On Feb 11, 2011, at 10:56 AM, Mitchell, Scott - IS wrote:

I'm doing something similar to what you are looking at. I have data coming in from multiple instruments which go through processing and result in one or several C# structures/arrays. In my example each instrument type has a structure containing Packet Tables with associated time axes/scales. The packet table structure mimics the instrument data structures.

Metadata is held in Attributes and other Packet Tables. I've created a standard across the program, with specifics defined for each instrument.

I end up storing each individual instrument's data in its own file. In most cases, a single thread processes and stores data, so I don't have to worry about synchronization (as much).

I believe you'll want to store each data type in its own dataset or file. For the ability to search by data type and data length issues. How are you expecting to search?

In my case, we allow users to 'play back' the data. I have the time scale as a separate dataset so I can do random access lookups without having to load large data records to find a specific time.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Thursday, February 10, 2011 6:10 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] hdf suitability for packetized data

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a
suitable file format for my application. The data I'm interested to store is
usually written by the collecting instrument to basic binary files of
concatenated packets (think c structures), each of which contains a header
with a time stamp, packet format, packet identifier, and packet size followed
by the data itself (arrays) and associated metadata. There are 10's of types
of packets that may come in any order and they are usually written to the file
sequentially. Packets contain from 10-100 fields, some of which may be arrays
of data of various sizes.

This format allows one to relatively quickly index a file by passing through
the file and parsing only these headers. Then one can use the index to pull
subsets of the data in a non-linear fashion, sometimes simultaneously in
multiple threads for quite fast reading. The problem is that every instrument
manufacturer has their own method of encoding packets and a single format is
needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such
that the same kind of indexing and parallel data retrieval is possible? What
is to be avoided is the need to read through a file sequentially to get to the
fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am
inexperienced and because most folks using it seem to be storing relatively
small numbers of very large arrays (imagery in many cases), rather than
relatively large numbers of smaller numbers of fields and smaller arrays, it
is not clear to me how such an implementation might perform. So I guess I'm
also asking, what is the relative penalty for writing lots of small sets of
data?

I hope this makes sense.

Thanks in advance,

Val
------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Val_Schmidt · February 11, 2011, 6:27pm

Hm.
This brings up a related question. I like the hierarchical structure of HDF files and the file-system like organization it brings. But it begs the question - can you do very fast queries using object names? For example can you use wildcards in the same way you might within a file system to pull data - ( /root/group/packet-* )
-Val

···

On Feb 11, 2011, at 11:54 AM, Mitchell, Scott - IS wrote:

The first search is pretty straight forward. My link is pretty simple, there's a 1:1 correspondence between the line numbers in the time scale & the dataset.

The two other searches have to be brute forced from within the Packet Table interface (H5PT) by iterating each line to just pull the individual field(s). There may be a better way from the dataset (H5D). I've stuck with the PT interface because I generally grab the whole dataset and it simplifies the process of adding new data.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Friday, February 11, 2011 11:28 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf suitability for packetized data

Your question is a good one.

I would need to be able to pull a full record (or set of records) within a set
of time bounds.
I would need to be able to pull some field from all records for all times - as
a time series.
I might need to be able to pull all the fields within some field range for all
times.

I'm thinking of something similar to what you have done (I think) - that is,
to self-index the file. The index would be in it's own dataset with an array
of time records and perhaps a few other fields and relative links (I forget
what HDF5 calls them) to the actual data records.

-Val

On Feb 11, 2011, at 10:56 AM, Mitchell, Scott - IS wrote:

I'm doing something similar to what you are looking at. I have data coming

in from multiple instruments which go through processing and result in one or
several C# structures/arrays. In my example each instrument type has a
structure containing Packet Tables with associated time axes/scales. The
packet table structure mimics the instrument data structures.

Metadata is held in Attributes and other Packet Tables. I've created a

standard across the program, with specifics defined for each instrument.

I end up storing each individual instrument's data in its own file. In most

cases, a single thread processes and stores data, so I don't have to worry
about synchronization (as much).

I believe you'll want to store each data type in its own dataset or file.

For the ability to search by data type and data length issues. How are you
expecting to search?

In my case, we allow users to 'play back' the data. I have the time scale as

a separate dataset so I can do random access lookups without having to load
large data records to find a specific time.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-

bounces@hdfgroup.org]

On Behalf Of Val Schmidt
Sent: Thursday, February 10, 2011 6:10 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] hdf suitability for packetized data

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a
suitable file format for my application. The data I'm interested to store

is

usually written by the collecting instrument to basic binary files of
concatenated packets (think c structures), each of which contains a header
with a time stamp, packet format, packet identifier, and packet size

followed

by the data itself (arrays) and associated metadata. There are 10's of

types

of packets that may come in any order and they are usually written to the

file

sequentially. Packets contain from 10-100 fields, some of which may be

arrays

of data of various sizes.

This format allows one to relatively quickly index a file by passing

through

the file and parsing only these headers. Then one can use the index to pull
subsets of the data in a non-linear fashion, sometimes simultaneously in
multiple threads for quite fast reading. The problem is that every

instrument

manufacturer has their own method of encoding packets and a single format

is

needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such
that the same kind of indexing and parallel data retrieval is possible?

What

is to be avoided is the need to read through a file sequentially to get to

the

fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am
inexperienced and because most folks using it seem to be storing relatively
small numbers of very large arrays (imagery in many cases), rather than
relatively large numbers of smaller numbers of fields and smaller arrays,

it

is not clear to me how such an implementation might perform. So I guess I'm
also asking, what is the relative penalty for writing lots of small sets of
data?

I hope this makes sense.

Thanks in advance,

Val
------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are

intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender.

Please note that any views or opinions presented in this e-mail are solely

those of the author and do not necessarily represent those of ITT Corporation.
The recipient should check this e-mail and any attachments for the presence of
viruses. ITT accepts no liability for any damage caused by any virus
transmitted by this e-mail.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

Quincey_Koziol · February 12, 2011, 12:46am

Hi Val,

Hm.
This brings up a related question. I like the hierarchical structure of HDF files and the file-system like organization it brings. But it begs the question - can you do very fast queries using object names? For example can you use wildcards in the same way you might within a file system to pull data - ( /root/group/packet-* )

We don't have wildcards per se, but you should be able to use link iteration (H5Literate or H5Lvisit) to achieve the same result.

Quincey

···

On Feb 11, 2011, at 10:27 AM, Val Schmidt wrote:

-Val

On Feb 11, 2011, at 11:54 AM, Mitchell, Scott - IS wrote:

The first search is pretty straight forward. My link is pretty simple, there's a 1:1 correspondence between the line numbers in the time scale & the dataset.

The two other searches have to be brute forced from within the Packet Table interface (H5PT) by iterating each line to just pull the individual field(s). There may be a better way from the dataset (H5D). I've stuck with the PT interface because I generally grab the whole dataset and it simplifies the process of adding new data.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of Val Schmidt
Sent: Friday, February 11, 2011 11:28 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] hdf suitability for packetized data

Your question is a good one.

I would need to be able to pull a full record (or set of records) within a set
of time bounds.
I would need to be able to pull some field from all records for all times - as
a time series.
I might need to be able to pull all the fields within some field range for all
times.

I'm thinking of something similar to what you have done (I think) - that is,
to self-index the file. The index would be in it's own dataset with an array
of time records and perhaps a few other fields and relative links (I forget
what HDF5 calls them) to the actual data records.

-Val

On Feb 11, 2011, at 10:56 AM, Mitchell, Scott - IS wrote:

I'm doing something similar to what you are looking at. I have data coming

in from multiple instruments which go through processing and result in one or
several C# structures/arrays. In my example each instrument type has a
structure containing Packet Tables with associated time axes/scales. The
packet table structure mimics the instrument data structures.

Metadata is held in Attributes and other Packet Tables. I've created a

standard across the program, with specifics defined for each instrument.

I end up storing each individual instrument's data in its own file. In most

cases, a single thread processes and stores data, so I don't have to worry
about synchronization (as much).

I believe you'll want to store each data type in its own dataset or file.

For the ability to search by data type and data length issues. How are you
expecting to search?

In my case, we allow users to 'play back' the data. I have the time scale as

a separate dataset so I can do random access lookups without having to load
large data records to find a specific time.

Scott

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-

bounces@hdfgroup.org]

On Behalf Of Val Schmidt
Sent: Thursday, February 10, 2011 6:10 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] hdf suitability for packetized data

Hello everyone,

I am new to HDF and am trying to understand whether or not it might be a
suitable file format for my application. The data I'm interested to store

is

usually written by the collecting instrument to basic binary files of
concatenated packets (think c structures), each of which contains a header
with a time stamp, packet format, packet identifier, and packet size

followed

by the data itself (arrays) and associated metadata. There are 10's of

types

of packets that may come in any order and they are usually written to the

file

sequentially. Packets contain from 10-100 fields, some of which may be

arrays

of data of various sizes.

This format allows one to relatively quickly index a file by passing

through

the file and parsing only these headers. Then one can use the index to pull
subsets of the data in a non-linear fashion, sometimes simultaneously in
multiple threads for quite fast reading. The problem is that every

instrument

manufacturer has their own method of encoding packets and a single format

is

needed for archival purposes.

My question to you is how might a similar model be implemented in HDF5 such
that the same kind of indexing and parallel data retrieval is possible?

What

is to be avoided is the need to read through a file sequentially to get to

the

fields to extract.

It seems like HDF5 should handle this kind of thing well, but because I am
inexperienced and because most folks using it seem to be storing relatively
small numbers of very large arrays (imagery in many cases), rather than
relatively large numbers of smaller numbers of fields and smaller arrays,

it

is not clear to me how such an implementation might perform. So I guess I'm
also asking, what is the relative penalty for writing lots of small sets of
data?

I hope this makes sense.

Thanks in advance,

Val
------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are

intended solely for the use of the individual or entity to whom they are
addressed. If you have received this e-mail in error please notify the sender.

Please note that any views or opinions presented in this e-mail are solely

those of the author and do not necessarily represent those of ITT Corporation.
The recipient should check this e-mail and any attachments for the presence of
viruses. ITT accepts no liability for any damage caused by any virus
transmitted by this e-mail.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

hdf suitability for packetized data