modelling tick data

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been experimenting with various options but there seem to be many, many different ways to model things and no relevant examples that I have come across.

Below I describe the data and its primary use as well as some questions about how I might most effectively model it within hdf5. I'm using the C interface and, to the degree possible, would like to use the HL interfaces as much as possible. Utlimately, I will also need to access this data via Java in some cases and believe that my best bet is to write the storage and query code in C and then use SWIG/JNI to access this via Java. (This is based on prototyping I've done and my assessment of the current Java hdf5 interface.) Thus, using pytables doesn't seem applicable for my circumstance.

I'll appreciate any responses, insights or pointers you might provide. Thanks and best wishes for the holidays,

     Tito.

···

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data. Each record is small (a few hundred bytes at the most), but there are many - in a day you may see many hundred million to a few billion. Each record is naturally partitioned by instrument (eg, "microsoft", "ibm", "dec crude", etc). There are less than 30K instruments in the universe I might care about.

I (more or less) don't care how long it takes to construct the h5 files/structures as it will be performed offline and the only critical query I care about is something like:

"Get ticks for instruments {i1...in} from time t1 to time t2 ordered by time, instr".

That is, I need to be able to "replay" a subset of the instruments within the data store over some period of time. But I really care that this be as fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate technology for the use I've described?

1. Given the size/volume of the data, my thought is to partition h5 files by day. Uncompressed, the files will be on the order of ~25G. Does this sound reasonable? What are the key factors impacting this decision from an hdf5 perspective?

Two alternative models come immediately to mind: one big table (OBT) per day ordered by instrument and then time, or one table per instrument (OTPI) ordered by time. My current inclination is OTPI as it seems more manageable assuming the overhead of so many tables isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to identify the beginning of each instrument's section (at least). How would you recommend doing this? It seems possible to do this with references or perhaps a separate table with numerical indices into the main table. Any pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per file. Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5 data into one sorted set as quickly as possible. Is there any hdf5 support for doing such a thing or external libraries created for this purpose?

3. What impact on retrieval/querying should I expect to see with varying levels of compression?

4. Any suggestions on chunksizes for this application?

Many thanks for any insights you might provide!

I've been writing a Microsoft Visual C++ Automated Trading Program. I've
used HDF5 for the data storage component. I use chunked storage to
facilitate on the file compression/decompression. I've used the built-in
HDF5 ability to break one large file into separate chunks. I use the HDF5
built-in directory to segregate data by symbol and day. By using the
built-in ability to link files, 'virtual directories' can be built up. I've
been collecting quote and trade data.

By creating the record structures in C++, they can easily be mapped into
HDF5. I use the boost::date library for high resolution datetime stamps.
Using some customized STL concept code, I can use b-tree searches on my data
for selecting datetime ranges. The ranges get read into memory structures
for further processing and organization. I store instruments by day.
Higher level code needs to worry about aggregating multiple days, if that is
needed.

My scale of data collection is no where near as extensive as yours
is/mightbe. But I think that with appropriate tuning, and clever
programming, you can get what you want. Just make sure that whatever you
do, is compiled. When one gets several hundred thousand quotes/trades for
an instrument per day, the sheer volume of data takes a while to I/O. If
compression is used (there are a few clever concepts HDF5 can use for when
sequential values vary only slightly in the last few bytes), that requires
additional horsepower.

If you are interested in details, let me know.

Ray

···

From: Tito Ingargiola [mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 13:58
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] modelling tick data

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many different
ways to model things and no relevant examples that I have come across.

Below I describe the data and its primary use as well as some questions
about how I might most effectively model it within hdf5. I'm using the C
interface and, to the degree possible, would like to use the HL interfaces
as much as possible. Utlimately, I will also need to access this data via
Java in some cases and believe that my best bet is to write the storage and
query code in C and then use SWIG/JNI to access this via Java. (This is
based on prototyping I've done and my assessment of the current Java hdf5
interface.) Thus, using pytables doesn't seem applicable for my
circumstance.

I'll appreciate any responses, insights or pointers you might provide.
Thanks and best wishes for the holidays,

     Tito.

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data. Each record
is small (a few hundred bytes at the most), but there are many - in a day
you may see many hundred million to a few billion. Each record is naturally
partitioned by instrument (eg, "microsoft", "ibm", "dec crude", etc). There
are less than 30K instruments in the universe I might care about.

I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only critical query
I care about is something like:

"Get ticks for instruments {i1...in} from time t1 to time t2 ordered by
time, instr".

That is, I need to be able to "replay" a subset of the instruments within
the data store over some period of time. But I really care that this be as
fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate technology for
the use I've described?

1. Given the size/volume of the data, my thought is to partition h5 files by
day. Uncompressed, the files will be on the order of ~25G. Does this sound
reasonable? What are the key factors impacting this decision from an hdf5
perspective?

Two alternative models come immediately to mind: one big table (OBT) per day
ordered by instrument and then time, or one table per instrument (OTPI)
ordered by time. My current inclination is OTPI as it seems more manageable
assuming the overhead of so many tables isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to identify
the beginning of each instrument's section (at least). How would you
recommend doing this? It seems possible to do this with references or
perhaps a separate table with numerical indices into the main table. Any
pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per file.
Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5 data
into one sorted set as quickly as possible. Is there any hdf5 support for
doing such a thing or external libraries created for this purpose?

3. What impact on retrieval/querying should I expect to see with varying
levels of compression?

4. Any suggestions on chunksizes for this application?

Many thanks for any insights you might provide!

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Hi Tito,

A Tuesday 23 December 2008, Tito Ingargiola escrigué:

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many
different ways to model things and no relevant examples that I have
come across.

Below I describe the data and its primary use as well as some
questions about how I might most effectively model it within hdf5.
I'm using the C interface and, to the degree possible, would like to
use the HL interfaces as much as possible. Utlimately, I will also
need to access this data via Java in some cases and believe that my
best bet is to write the storage and query code in C and then use
SWIG/JNI to access this via Java. (This is based on prototyping I've
done and my assessment of the current Java hdf5 interface.) Thus,
using pytables doesn't seem applicable for my circumstance.

Don't discard PyTables so soon :wink: You could use pydap [1] for serving
PyTables files through the Data Access Protocol [2] and then using one
of the DAP adapters [3] for your preferred language on the client side.

[1] http://pydap.org/
[2] http://opendap.org/
[3] http://opendap.org/download/index.html

In order to adapt pydap better to your needs, you could even modify the
PyTables plugin (it is very easy to understand) for pydap and taylor it
to your needs.

Also, and in addition to the (excellent) advices that Ray has already
given to you, look at my comments interspersed in your message.

I'll appreciate any responses, insights or pointers you might
provide. Thanks and best wishes for the holidays,

     Tito.

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data. Each
record is small (a few hundred bytes at the most), but there are many
- in a day you may see many hundred million to a few billion. Each
record is naturally partitioned by instrument (eg, "microsoft",
"ibm", "dec crude", etc). There are less than 30K instruments in the
universe I might care about.

I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only
critical query I care about is something like:

"Get ticks for instruments {i1...in} from time t1 to time t2 ordered
by time, instr".

That is, I need to be able to "replay" a subset of the instruments
within the data store over some period of time. But I really care
that this be as fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate
technology for the use I've described?

In my experience, HDF5 is perfectly appropriate to cope with this. It
is, as many other things, just a matter of properly directing it to do
what you want to do :wink: In particular, your volume of data seems that
is going to be very large, so you should be very careful when choosing
the different parameters for your application. Remember that
experimenting is your best friend before putting code into production.

1. Given the size/volume of the data, my thought is to partition h5
files by day. Uncompressed, the files will be on the order of ~25G.
Does this sound reasonable? What are the key factors impacting this
decision from an hdf5 perspective?

25GB is a completely reasonable figure for a single file. Moreover, by
using compression, you can reduce this even more, so I see not problem
on this. However, how data is organised inside the file becomes very
important for handling data efficiently (see below).

Two alternative models come immediately to mind: one big table (OBT)
per day ordered by instrument and then time, or one table per
instrument (OTPI) ordered by time. My current inclination is OTPI as
it seems more manageable assuming the overhead of so many tables
isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to
identify the beginning of each instrument's section (at least). How
would you recommend doing this? It seems possible to do this with
references or perhaps a separate table with numerical indices into
the main table. Any pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per
file. Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5
data into one sorted set as quickly as possible. Is there any hdf5
support for doing such a thing or external libraries created for this
purpose?

Effectively, both OTPI and OBT has its advantages and drawbacks. OTPI,
for me, has the disavantage of requiring a respectable amount of tables
for your case (having 30000 datasets in a single file is nothing to
sneeze at), as well as requiring a somewhat more complicated query
code. The OBT one is probably a simpler and better approach in that
HDF5 has to deal with less metadata (the 30000 datasets are avoided),
so it is probably faster, but you may need additional logic in your
programs to perform fast queries (indexing will help a lot indeed).
I'd recommend doing your own experiments here.

Regarding merging sorted datasets, you can always implement a typical
merge sort for your needs. Otherwise, you may want to use the sorting
capabilities of PyTables Pro that can handle arbitrarily large tables.

3. What impact on retrieval/querying should I expect to see with
varying levels of compression?

See:

http://www.pytables.org/docs/manual/ch05.html#searchOptim

for some experiments that I've recently done on this. They are meant
for PyTables, but you could find interesting hints for your case too.

4. Any suggestions on chunksizes for this application?

See:

http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune

for more experiments in that regard.

Many thanks for any insights you might provide!

Hope that helps, and Merry Christmas!

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Ray & Francesc,

Thank you both very much for your informative responses! Although each of you champions a different approach, both provide helpful ideas - thank you. (Indeed, the fact that both approaches seem workable is informative on its own!) I also agree that experimentation here is king, but I want to balance reasonable experimentation with not reinventing the wheel when so many are already zipping along...

Francesc, I had already read these excellent docs you have written and am very impressed by the work you've done with pytables. It sounds like you're leary of the overhead of an OTPI approach and I can see why. A couple further question for you about indexing into OBT - while my data layout from a column/field looks something like: { contractID, dateTime, ... } with an OBT approach I think I would have to additionally add a field for performing a binary search within a contract: { index, contractID, dateTime,... } and would additionally need to have an external table indexing into OBT, identifying where each new contract begins (and perhaps more). Does this sound right to you? Do you have any suggestions on how to "point into" the OBT?

Many thanks for your help and best wishes for a Merry Xmas and Happy Holidays,

    Tito.

···

________________________________
From: Francesc Alted <faltet@pytables.com>
To: hdf-forum@hdfgroup.org
Sent: Wednesday, December 24, 2008 7:58:33 AM
Subject: Re: [hdf-forum] modelling tick data

Hi Tito,

A Tuesday 23 December 2008, Tito Ingargiola escrigué:

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many
different ways to model things and no relevant examples that I have
come across.

Below I describe the data and its primary use as well as some
questions about how I might most effectively model it within hdf5.
I'm using the C interface and, to the degree possible, would like to
use the HL interfaces as much as possible. Utlimately, I will also
need to access this data via Java in some cases and believe that my
best bet is to write the storage and query code in C and then use
SWIG/JNI to access this via Java. (This is based on prototyping I've
done and my assessment of the current Java hdf5 interface.) Thus,
using pytables doesn't seem applicable for my circumstance.

Don't discard PyTables so soon :wink: You could use pydap [1] for serving
PyTables files through the Data Access Protocol [2] and then using one
of the DAP adapters [3] for your preferred language on the client side.

[1] http://pydap.org/
[2] http://opendap.org/
[3] http://opendap.org/download/index.html

In order to adapt pydap better to your needs, you could even modify the
PyTables plugin (it is very easy to understand) for pydap and taylor it
to your needs.

Also, and in addition to the (excellent) advices that Ray has already
given to you, look at my comments interspersed in your message.

I'll appreciate any responses, insights or pointers you might
provide. Thanks and best wishes for the holidays,

     Tito.

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data. Each
record is small (a few hundred bytes at the most), but there are many
- in a day you may see many hundred million to a few billion. Each
record is naturally partitioned by instrument (eg, "microsoft",
"ibm", "dec crude", etc). There are less than 30K instruments in the
universe I might care about.

I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only
critical query I care about is something like:

"Get ticks for instruments {i1...in} from time t1 to time t2 ordered
by time, instr".

That is, I need to be able to "replay" a subset of the instruments
within the data store over some period of time. But I really care
that this be as fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate
technology for the use I've described?

In my experience, HDF5 is perfectly appropriate to cope with this. It
is, as many other things, just a matter of properly directing it to do
what you want to do :wink: In particular, your volume of data seems that
is going to be very large, so you should be very careful when choosing
the different parameters for your application. Remember that
experimenting is your best friend before putting code into production.

1. Given the size/volume of the data, my thought is to partition h5
files by day. Uncompressed, the files will be on the order of ~25G.
Does this sound reasonable? What are the key factors impacting this
decision from an hdf5 perspective?

25GB is a completely reasonable figure for a single file. Moreover, by
using compression, you can reduce this even more, so I see not problem
on this. However, how data is organised inside the file becomes very
important for handling data efficiently (see below).

Two alternative models come immediately to mind: one big table (OBT)
per day ordered by instrument and then time, or one table per
instrument (OTPI) ordered by time. My current inclination is OTPI as
it seems more manageable assuming the overhead of so many tables
isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to
identify the beginning of each instrument's section (at least). How
would you recommend doing this? It seems possible to do this with
references or perhaps a separate table with numerical indices into
the main table. Any pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per
file. Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5
data into one sorted set as quickly as possible. Is there any hdf5
support for doing such a thing or external libraries created for this
purpose?

Effectively, both OTPI and OBT has its advantages and drawbacks. OTPI,
for me, has the disavantage of requiring a respectable amount of tables
for your case (having 30000 datasets in a single file is nothing to
sneeze at), as well as requiring a somewhat more complicated query
code. The OBT one is probably a simpler and better approach in that
HDF5 has to deal with less metadata (the 30000 datasets are avoided),
so it is probably faster, but you may need additional logic in your
programs to perform fast queries (indexing will help a lot indeed).
I'd recommend doing your own experiments here.

Regarding merging sorted datasets, you can always implement a typical
merge sort for your needs. Otherwise, you may want to use the sorting
capabilities of PyTables Pro that can handle arbitrarily large tables.

3. What impact on retrieval/querying should I expect to see with
varying levels of compression?

See:

http://www.pytables.org/docs/manual/ch05.html#searchOptim

for some experiments that I've recently done on this. They are meant
for PyTables, but you could find interesting hints for your case too.

4. Any suggestions on chunksizes for this application?

See:

http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune

for more experiments in that regard.

Many thanks for any insights you might provide!

Hope that helps, and Merry Christmas!

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Wednesday 24 December 2008, Tito Ingargiola escrigué:

Hi Ray & Francesc,

Thank you both very much for your informative responses! Although
each of you champions a different approach, both provide helpful
ideas - thank you. (Indeed, the fact that both approaches seem
workable is informative on its own!) I also agree that
experimentation here is king, but I want to balance reasonable
experimentation with not reinventing the wheel when so many are
already zipping along...

Francesc, I had already read these excellent docs you have written
and am very impressed by the work you've done with pytables. It
sounds like you're leary of the overhead of an OTPI approach and I
can see why. A couple further question for you about indexing into
OBT - while my data layout from a column/field looks something like:
{ contractID, dateTime, ... } with an OBT approach I think I would
have to additionally add a field for performing a binary search
within a contract: { index, contractID, dateTime,... } and would
additionally need to have an external table indexing into OBT,
identifying where each new contract begins (and perhaps more). Does
this sound right to you? Do you have any suggestions on how to
"point into" the OBT?

Yes, that's a perfectly good way to 'index' data for the OBT approach,
IMO. In order to 'point to' the OBT you may want to use either just
start and end rows or HDF5 references to table hyperslabs (ranges of
rows in this case), the approach you find more confortable for your
needs.

Of course, you will need to re-compute your indexes after every table
merge. Having an integrated indexing engine would make things a lot
easier, but the above approach is doable anyway.

At any rate, it would be really interesting to know which approach (OTPI
vs OBT) finally works best for you.

Cheers,

Francesc

···

Many thanks for your help and best wishes for a Merry Xmas and Happy
Holidays,

    Tito.

________________________________
From: Francesc Alted <faltet@pytables.com>
To: hdf-forum@hdfgroup.org
Sent: Wednesday, December 24, 2008 7:58:33 AM
Subject: Re: [hdf-forum] modelling tick data

Hi Tito,

A Tuesday 23 December 2008, Tito Ingargiola escrigué:
> Hi,
>
> I'm trying to figure out how to best use hdf5 for my data. I've
> been experimenting with various options but there seem to be many,
> many different ways to model things and no relevant examples that I
> have come across.
>
> Below I describe the data and its primary use as well as some
> questions about how I might most effectively model it within hdf5.
> I'm using the C interface and, to the degree possible, would like
> to use the HL interfaces as much as possible. Utlimately, I will
> also need to access this data via Java in some cases and believe
> that my best bet is to write the storage and query code in C and
> then use SWIG/JNI to access this via Java. (This is based on
> prototyping I've done and my assessment of the current Java hdf5
> interface.) Thus, using pytables doesn't seem applicable for my
> circumstance.

Don't discard PyTables so soon :wink: You could use pydap [1] for
serving PyTables files through the Data Access Protocol [2] and then
using one of the DAP adapters [3] for your preferred language on the
client side.

[1] http://pydap.org/
[2] http://opendap.org/
[3] http://opendap.org/download/index.html

In order to adapt pydap better to your needs, you could even modify
the PyTables plugin (it is very easy to understand) for pydap and
taylor it to your needs.

Also, and in addition to the (excellent) advices that Ray has already
given to you, look at my comments interspersed in your message.

> I'll appreciate any responses, insights or pointers you might
> provide. Thanks and best wishes for the holidays,
>
> Tito.
>
> --
>
> A description of the data and its use
>
> The data is all timestamped financial streams of "tick" data. Each
> record is small (a few hundred bytes at the most), but there are
> many - in a day you may see many hundred million to a few billion.
> Each record is naturally partitioned by instrument (eg,
> "microsoft", "ibm", "dec crude", etc). There are less than 30K
> instruments in the universe I might care about.
>
> I (more or less) don't care how long it takes to construct the h5
> files/structures as it will be performed offline and the only
> critical query I care about is something like:
>
>
> "Get ticks for instruments {i1...in} from time t1 to time t2
> ordered by time, instr".
>
> That is, I need to be able to "replay" a subset of the instruments
> within the data store over some period of time. But I really care
> that this be as fast as possible.
>
> Questions
>
> 0. Am I barking up the wrong tree? Is HDF5 an appropriate
> technology for the use I've described?

In my experience, HDF5 is perfectly appropriate to cope with this.
It is, as many other things, just a matter of properly directing it
to do what you want to do :wink: In particular, your volume of data
seems that is going to be very large, so you should be very careful
when choosing the different parameters for your application.
Remember that experimenting is your best friend before putting code
into production.

> 1. Given the size/volume of the data, my thought is to partition h5
> files by day. Uncompressed, the files will be on the order of
> ~25G. Does this sound reasonable? What are the key factors
> impacting this decision from an hdf5 perspective?

25GB is a completely reasonable figure for a single file. Moreover,
by using compression, you can reduce this even more, so I see not
problem on this. However, how data is organised inside the file
becomes very important for handling data efficiently (see below).

> Two alternative models come immediately to mind: one big table
> (OBT) per day ordered by instrument and then time, or one table per
> instrument (OTPI) ordered by time. My current inclination is OTPI
> as it seems more manageable assuming the overhead of so many tables
> isn't an issue.
>
> 2a. Are there other, better models you suggest I investigate?
>
> 2b. With the OBT, I'd need to be able "index into" the table to
> identify the beginning of each instrument's section (at least).
> How would you recommend doing this? It seems possible to do this
> with references or perhaps a separate table with numerical indices
> into the main table. Any pros/cons/alternatives to these
> approaches?
>
> 2c. With the OTPI, I'd need to have many tables (at most ~30K) per
> file. Is this an issue?
>
> 2d. For both models, I'd need to be able to merge sorted sets of h5
> data into one sorted set as quickly as possible. Is there any hdf5
> support for doing such a thing or external libraries created for
> this purpose?

Effectively, both OTPI and OBT has its advantages and drawbacks.
OTPI, for me, has the disavantage of requiring a respectable amount
of tables for your case (having 30000 datasets in a single file is
nothing to sneeze at), as well as requiring a somewhat more
complicated query code. The OBT one is probably a simpler and better
approach in that HDF5 has to deal with less metadata (the 30000
datasets are avoided), so it is probably faster, but you may need
additional logic in your programs to perform fast queries (indexing
will help a lot indeed). I'd recommend doing your own experiments
here.

Regarding merging sorted datasets, you can always implement a typical
merge sort for your needs. Otherwise, you may want to use the
sorting capabilities of PyTables Pro that can handle arbitrarily
large tables.

> 3. What impact on retrieval/querying should I expect to see with
> varying levels of compression?

See:

http://www.pytables.org/docs/manual/ch05.html#searchOptim

for some experiments that I've recently done on this. They are meant
for PyTables, but you could find interesting hints for your case too.

> 4. Any suggestions on chunksizes for this application?

See:

http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune

for more experiments in that regard.

> Many thanks for any insights you might provide!

Hope that helps, and Merry Christmas!

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Yes, that's a perfectly good way to 'index' data for the OBT
approach, IMO. In order to 'point to' the OBT you may want
to use either just start and end rows or HDF5 references to
table hyperslabs (ranges of rows in this case), the approach
you find more comfortable for your needs.

Of course, you will need to re-compute your indexes after
every table merge. Having an integrated indexing engine
would make things a lot easier, but the above approach is
doable anyway.

At any rate, it would be really interesting to know which
approach (OTPI vs OBT) finally works best for you.

Some clarifications to earlier messages, and a note on the content of the
above.

I made a comment that hdf5 files may need to be kept small for back up
purposes and archival. I also made mention that hdf5 files can become
corrupt if a program crashes. One clarification on this point. If the
program has written to the file, and crashes before closing the file, then
corruption may occur as internal file structures may not have been updated
properly (even after having used the flush() api). I usually run my known
good update programs, exit them, then run my experimental scanners and such.
If a program crashes after having only read from the file, I find that no
corruption occurs.

In some of my earlier notes, I was mentioning directories and files. I
meant to say HDF5 groups and datasets. That is to say that I've been using
One Big File holding One Dataset Per Instrument for my datasets and using
groups and links for organizing the data within the file. One dataset
typically stores data for one contract/instrument of a specific data type,
for instance, ticks for GOOG. Rather than use One Big Dataset for all ticks
for GOOG, I'll save the data into one dataset per day of ticks. It makes
browsing and grouping data easier with a group viewer. The ticks are stored
in each dataset in arrival-order with the arrival datestamp on each record.
For those people with precise exchange time stamps, those could be used
instead. If I wish to use a subset of the data, I simply perform a binary
search on the record datetime stamp in an effort to calculate the record
numbers of the start and end records (basically file reads of one record at
a time with log(n) search time). The C++ Standard Template Library has a
search algorithm already defined. One just needs to provide a customized
STL random iterator to access the data, which is the mechanism I used. Once
the start and end record numbers are determined, the HDF5 block read api's
can be used to read data into memory (which is what I do), or the data can
be read incrementally on demand, which is what would need to be done when
merging (min-heap record sorting) on the fly in memory constrained
solutions.

In the end, any number of approaches can be used (based upon group and
dataset design):
* use the intrinsic sorted keys of the data to perform in-place lookups and
searches
* maintain separate manually computed and stored indexes for indexing into
the datasets

···

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Saturday 27 December 2008, Ray Burkholder escrigué:

> Yes, that's a perfectly good way to 'index' data for the OBT
> approach, IMO. In order to 'point to' the OBT you may want
> to use either just start and end rows or HDF5 references to
> table hyperslabs (ranges of rows in this case), the approach
> you find more comfortable for your needs.
>
> Of course, you will need to re-compute your indexes after
> every table merge. Having an integrated indexing engine
> would make things a lot easier, but the above approach is
> doable anyway.
>
> At any rate, it would be really interesting to know which
> approach (OTPI vs OBT) finally works best for you.

Some clarifications to earlier messages, and a note on the content of
the above.

I made a comment that hdf5 files may need to be kept small for back
up purposes and archival. I also made mention that hdf5 files can
become corrupt if a program crashes. One clarification on this
point. If the program has written to the file, and crashes before
closing the file, then corruption may occur as internal file
structures may not have been updated properly (even after having used
the flush() api). I usually run my known good update programs, exit
them, then run my experimental scanners and such. If a program
crashes after having only read from the file, I find that no
corruption occurs.

Yeah, checking frequently your updated files is a good practice.
Moreover, IIRC someone from the HDF5 crew has said in this forum that
they are working on increasing the reliability of the HDF5 files
(journaling and also being able to recover corrupt files). Hope they
can deliver these exciting developments more sooner than later :wink:

In some of my earlier notes, I was mentioning directories and files.
I meant to say HDF5 groups and datasets. That is to say that I've
been using One Big File holding One Dataset Per Instrument for my
datasets and using groups and links for organizing the data within
the file. One dataset typically stores data for one
contract/instrument of a specific data type, for instance, ticks for
GOOG. Rather than use One Big Dataset for all ticks for GOOG, I'll
save the data into one dataset per day of ticks. It makes browsing
and grouping data easier with a group viewer. The ticks are stored
in each dataset in arrival-order with the arrival datestamp on each
record. For those people with precise exchange time stamps, those
could be used instead. If I wish to use a subset of the data, I
simply perform a binary search on the record datetime stamp in an
effort to calculate the record numbers of the start and end records
(basically file reads of one record at a time with log(n) search
time). The C++ Standard Template Library has a search algorithm
already defined. One just needs to provide a customized STL random
iterator to access the data, which is the mechanism I used. Once the
start and end record numbers are determined, the HDF5 block read
api's can be used to read data into memory (which is what I do), or
the data can be read incrementally on demand, which is what would
need to be done when merging (min-heap record sorting) on the fly in
memory constrained solutions.

In the end, any number of approaches can be used (based upon group
and dataset design):
* use the intrinsic sorted keys of the data to perform in-place
lookups and searches
* maintain separate manually computed and stored indexes for indexing
into the datasets

Good point. However, and although binary searches are very effective
when your lookups are made against storage that has low latency, in the
(very common) case you are using high latency storage (for example,
regular spinning disks), this can slowdown your lookups quite a lot.
Even if you are using reduced latency storage (read Solid State Disks),
the use of compression can easily become the bottleneck because many
data chunks have to be decompressed before the binary search completes.

So, if your goal is to ensure really short lookups, you have many
options, but it mainly boils down to:

1. Use manually computed indexes specifically taylored for quickly
locating the interesting information.

2. Keep your datasets small, sorted and no compressed, so that all the
data can be loaded in-memory and binary searches can be performed in
such a low-latency media.

3. Use efficient, automatic indexes in interesting fields that can
minimize the costs of the searches.

Solution 1 is generally the fastest, but it requires work to adapt it to
the user needs. Solution 2 is somewhat slower, but it can be adapted
to more scenarios with much less work; however it only applies when you
can keep your datasets small. Finally, solution 3 is completely
general and, when implemented efficiently, can deliver more than decent
performance even when used on low-latency storage devices.

OPSI, the integrated indexing engine for PyTables Pro follows route 3
(as many relational databases do), with fairly good results. Anyone
interested in implementing binary searches by himself on very large
tables may want to read the OPSI article [1] for finding new
optimization paths for their own use case (specially if they are using
the HDF5 library as foundation, of course).

[1] http://www.pytables.org/docs/OPSI-indexes.pdf

Cheers,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Ray,

Thank you for responding. It sounds like you have adopted my "OTPI" approach, no? Do you have any idea at what point it will breakdown? That is, is there an effective limit to the number of datasets/tables within one file? How did you determine what chunksizes to use? Could you detail your merging/sorting a little bit further? In particular, you mention putting selections into memory, but in some cases I care about this would definitely exhaust memory, so I need something that will perform a sorted merge as it's reading from hdf5.

Thanks for your help,

     Tito.

···

________________________________
From: Ray Burkholder <ray@oneunified.net>
To: Tito Ingargiola <tito@puppetmastertrading.com>; hdf-forum@hdfgroup.org
Sent: Tuesday, December 23, 2008 1:15:26 PM
Subject: RE: [hdf-forum] modelling tick data

I've been writing a Microsoft Visual C++ Automated Trading
Program. I've used HDF5 for the data storage component. I use
chunked storage to facilitate on the file compression/decompression.
I've used the built-in HDF5 ability to break one large file into separate
chunks. I use the HDF5 built-in directory to segregate data by symbol and
day. By using the built-in ability to link files, 'virtual directories'
can be built up. I've been collecting quote and trade data.

By creating the record structures in C++, they can easily be
mapped into HDF5. I use the boost::date library for high resolution
datetime stamps. Using some customized STL concept code, I can use
b-tree searches on my data for selecting datetime ranges. The ranges get read
into memory structures for further processing and organization. I store
instruments by day. Higher level code needs to worry about aggregating
multiple days, if that is needed.

My scale of data collection is no where near as extensive as
yours is/mightbe. But I think that with appropriate tuning, and clever
programming, you can get what you want. Just make sure that
whatever you do, is compiled. When one gets several hundred
thousand quotes/trades for an instrument per day, the sheer volume of data
takes a while to I/O. If compression is used (there are a few clever
concepts HDF5 can use for when sequential values vary only slightly in the last
few bytes), that requires additional horsepower.

If you are interested in details, let me know.

Ray

From:Tito Ingargiola
[mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 13:58
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] modelling tick data

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many different
ways to model things and no relevant examples that I have come across.

Below I describe the data and its primary use as well as some questions about
how I might most effectively model it within hdf5. I'm using the C
interface and, to the degree possible, would like to use the HL interfaces as
much as possible. Utlimately, I will also need to access this data via
Java in some cases and believe that my best bet is to write the storage and
query code in C and then use SWIG/JNI to access this via Java. (This is
based on prototyping I've done and my assessment of the current Java hdf5
interface.) Thus, using pytables doesn't seem applicable for my circumstance.

I'll appreciate any responses, insights or pointers you might provide.
Thanks and best wishes for the holidays,

     Tito.

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data.
Each record is small (a few hundred bytes at the most), but there are many - in
a day you may see many hundred million to a few billion. Each record is
naturally partitioned by instrument (eg, "microsoft",
"ibm", "dec crude", etc). There are less than 30K
instruments in the universe I might care about.

I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only critical query I
care about is something like:
"Get
ticks for instruments {i1...in} from time t1 to time t2 ordered by time,
instr".

That is, I need to be able to "replay" a subset of the instruments
within the data store over some period of time. But I really care that
this be as fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate
technology for the use I've described?

1. Given the size/volume of the data, my thought is to partition h5 files by
day. Uncompressed, the files will be on the order of ~25G. Does
this sound reasonable? What are the key factors impacting this decision
from an hdf5 perspective?

Two alternative models come immediately to mind: one big table (OBT) per day
ordered by instrument and then time, or one table per instrument (OTPI) ordered
by time. My current inclination is OTPI as it seems more manageable
assuming the overhead of so many tables isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to
identify the beginning of each instrument's section (at least). How would
you recommend doing this? It seems possible to do this with references or
perhaps a separate table with numerical indices into the main table. Any
pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per
file. Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5 data into
one sorted set as quickly as possible. Is there any hdf5 support for
doing such a thing or external libraries created for this purpose?

3. What impact on retrieval/querying should I expect to see with varying levels
of compression?

4. Any suggestions on chunksizes for this application?

Many thanks for any insights you might provide!

--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.

Hi Tito,

Yes, I've adopted the OTPI approach with further refinements I do mine per
day, and further do it by quote, trade, and market depth. HDF5 is designed
for any level of directory trees so make use of them for optimizing and
cataloguing the data. The attached picture shows a simple tree to minimize
directory size at any depth. This shows my daily bar repository. Other
directories contain my basket trades using quotes and trades. Any number of
directories are possible.

Chunksizes would be something to test with. I simply pulled a number out of
the hat. Mine may not be optimal, but it works for the decompression until
I can get around to optimizing things.

For the simulation side of things, I wrote up something at
http://www.oneunified.net/blog/OpenSource/Programming/minheap.article\. I
use in-memory arrays as I have enough memory available to perform the
multi-symbol merge. In actual fact, I don't do the merge before hand, I do
a run-time min-heap merge. This could be readily expanded to pulling data
from the files on an as needed basis for when data does not fit within
memory.

I make use of C++ templates to handle quotes and trades with the same code.
The code is duplicated for each class type, but I prefer speed optimization
over space optimization.

I've followed a reasonable object oriented hierarchy. CQuote, CTrade, CBar,
CMarketDepth are the basic data types derived from a class called
CDatedDatum. These classes define how they will store themselves in the
HDF5 files.

There is then a set of time series classes CQuotes, CTrades, CBars,
CMarketDepths, Iterators are provided in each for retrieving and interating
through elements based upon an integer or datetime index. The time series
can be saved or retrieved from an HDF5 table. Appending and over-writing is
also possible.

These operations rely on separate HDF5 Container, Accessor, and Iterator
templates I've written for facilitating saving, retrieving, and iterating
through on disk time series.

My total dataset is only at the 2GB size. I've noticed that some HDF5 tools
don't like working with files that have been split, including HDFView. I
recommend against using that option. ie data.000.hdf5, data.001.hdf5, where
each file is of a fixed maximum size. It might be better to handle file
splitting manually or use one massively big file. But a massively big file
may present archival/backup/copying problems. If your program is not
completely crash proof, especially during development, hdf5 files can become
corrupted, and may or may not be fixable. I'd recommend against a single
massive file. If you are dealing with 30k instruments, then perhaps a file
per day might be appropriate.

Hope this helps.

···

_____

From: Tito Ingargiola [mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 15:50
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] modelling tick data

Hi Ray,

Thank you for responding. It sounds like you have adopted my "OTPI"
approach, no? Do you have any idea at what point it will breakdown? That
is, is there an effective limit to the number of datasets/tables within one
file? How did you determine what chunksizes to use? Could you detail your
merging/sorting a little bit further? In particular, you mention putting
selections into memory, but in some cases I care about this would definitely
exhaust memory, so I need something that will perform a sorted merge as it's
reading from hdf5.

Thanks for your help,

     Tito.

  _____

From: Ray Burkholder <ray@oneunified.net>
To: Tito Ingargiola <tito@puppetmastertrading.com>; hdf-forum@hdfgroup.org
Sent: Tuesday, December 23, 2008 1:15:26 PM
Subject: RE: [hdf-forum] modelling tick data

I've been writing a Microsoft Visual C++ Automated Trading Program. I've
used HDF5 for the data storage component. I use chunked storage to
facilitate on the file compression/decompression. I've used the built-in
HDF5 ability to break one large file into separate chunks. I use the HDF5
built-in directory to segregate data by symbol and day. By using the
built-in ability to link files, 'virtual directories' can be built up. I've
been collecting quote and trade data.

By creating the record structures in C++, they can easily be mapped into
HDF5. I use the boost::date library for high resolution datetime stamps.
Using some customized STL concept code, I can use b-tree searches on my data
for selecting datetime ranges. The ranges get read into memory structures
for further processing and organization. I store instruments by day.
Higher level code needs to worry about aggregating multiple days, if that is
needed.

My scale of data collection is no where near as extensive as yours
is/mightbe. But I think that with appropriate tuning, and clever
programming, you can get what you want. Just make sure that whatever you
do, is compiled. When one gets several hundred thousand quotes/trades for
an instrument per day, the sheer volume of data takes a while to I/O. If
compression is used (there are a few clever concepts HDF5 can use for when
sequential values vary only slightly in the last few bytes), that requires
additional horsepower.

If you are interested in details, let me know.

Ray

From: Tito Ingargiola [mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 13:58
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] modelling tick data

Hi,

I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many different
ways to model things and no relevant examples that I have come across.

Below I describe the data and its primary use as well as some questions
about how I might most effectively model it within hdf5. I'm using the C
interface and, to the degree possible, would like to use the HL interfaces
as much as possible. Utlimately, I will also need to access this data via
Java in some cases and believe that my best bet is to write the storage and
query code in C and then use SWIG/JNI to access this via Java. (This is
based on prototyping I've done and my assessment of the current Java hdf5
interface.) Thus, using pytables doesn't seem applicable for my
circumstance.

I'll appreciate any responses, insights or pointers you might provide.
Thanks and best wishes for the holidays,

     Tito.

--

A description of the data and its use

The data is all timestamped financial streams of "tick" data. Each record
is small (a few hundred bytes at the most), but there are many - in a day
you may see many hundred million to a few billion. Each record is naturally
partitioned by instrument (eg, "microsoft", "ibm", "dec crude", etc). There
are less than 30K instruments in the universe I might care about.

I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only critical query
I care about is something like:

"Get ticks for instruments {i1...in} from time t1 to time t2 ordered by
time, instr".

That is, I need to be able to "replay" a subset of the instruments within
the data store over some period of time. But I really care that this be as
fast as possible.

Questions

0. Am I barking up the wrong tree? Is HDF5 an appropriate technology for
the use I've described?

1. Given the size/volume of the data, my thought is to partition h5 files by
day. Uncompressed, the files will be on the order of ~25G. Does this sound
reasonable? What are the key factors impacting this decision from an hdf5
perspective?

Two alternative models come immediately to mind: one big table (OBT) per day
ordered by instrument and then time, or one table per instrument (OTPI)
ordered by time. My current inclination is OTPI as it seems more manageable
assuming the overhead of so many tables isn't an issue.

2a. Are there other, better models you suggest I investigate?

2b. With the OBT, I'd need to be able "index into" the table to identify
the beginning of each instrument's section (at least). How would you
recommend doing this? It seems possible to do this with references or
perhaps a separate table with numerical indices into the main table. Any
pros/cons/alternatives to these approaches?

2c. With the OTPI, I'd need to have many tables (at most ~30K) per file.
Is this an issue?

2d. For both models, I'd need to be able to merge sorted sets of h5 data
into one sorted set as quickly as possible. Is there any hdf5 support for
doing such a thing or external libraries created for this purpose?

3. What impact on retrieval/querying should I expect to see with varying
levels of compression?

4. Any suggestions on chunksizes for this application?

Many thanks for any insights you might provide!

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Hi,

Francesc & Ray, I wanted to thank you for your detailed and
excellent insights into using hdf5 for my application.

[...] it would be really interesting to know which approach (OTPI
vs OBT) finally works best for you.

I've spent some
time over the holidays looking into both approaches and have blogged some thoughts on it here:

http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/

and will follow-up with some more detailed results within the coming week.

Many thanks for your help!

     Tito.

···

________________________________
From: Francesc Alted <faltet@pytables.com>
To: hdf-forum@hdfgroup.org
Sent: Wednesday, December 24, 2008 1:54:24 PM
Subject: Re: [hdf-forum] modelling tick data

A Wednesday 24 December 2008, Tito Ingargiola escrigué:

Hi Ray & Francesc,

Thank you both very much for your informative responses! Although
each of you champions a different approach, both provide helpful
ideas - thank you. (Indeed, the fact that both approaches seem
workable is informative on its own!) I also agree that
experimentation here is king, but I want to balance reasonable
experimentation with not reinventing the wheel when so many are
already zipping along...

Francesc, I had already read these excellent docs you have written
and am very impressed by the work you've done with pytables. It
sounds like you're leary of the overhead of an OTPI approach and I
can see why. A couple further question for you about indexing into
OBT - while my data layout from a column/field looks something like:
{ contractID, dateTime, ... } with an OBT approach I think I would
have to additionally add a field for performing a binary search
within a contract: { index, contractID, dateTime,... } and would
additionally need to have an external table indexing into OBT,
identifying where each new contract begins (and perhaps more). Does
this sound right to you? Do you have any suggestions on how to
"point into" the OBT?

Yes, that's a perfectly good way to 'index' data for the OBT approach,
IMO. In order to 'point to' the OBT you may want to use either just
start and end rows or HDF5 references to table hyperslabs (ranges of
rows in this case), the approach you find more confortable for your
needs.

Of course, you will need to re-compute your indexes after every table
merge. Having an integrated indexing engine would make things a lot
easier, but the above approach is doable anyway.

At any rate, it would be really interesting to know which approach (OTPI
vs OBT) finally works best for you.

Cheers,

Francesc

Many thanks for your help and best wishes for a Merry Xmas and Happy
Holidays,

    Tito.

________________________________
From: Francesc Alted <faltet@pytables.com>
To: hdf-forum@hdfgroup.org
Sent: Wednesday, December 24, 2008 7:58:33 AM
Subject: Re: [hdf-forum] modelling tick data

Hi Tito,

A Tuesday 23 December 2008, Tito Ingargiola escrigué:
> Hi,
>
> I'm trying to figure out how to best use hdf5 for my data. I've
> been experimenting with various options but there seem to be many,
> many different ways to model things and no relevant examples that I
> have come across.
>
> Below I describe the data and its primary use as well as some
> questions about how I might most effectively model it within hdf5.
> I'm using the C interface and, to the degree possible, would like
> to use the HL interfaces as much as possible. Utlimately, I will
> also need to access this data via Java in some cases and believe
> that my best bet is to write the storage and query code in C and
> then use SWIG/JNI to access this via Java. (This is based on
> prototyping I've done and my assessment of the current Java hdf5
> interface.) Thus, using pytables doesn't seem applicable for my
> circumstance.

Don't discard PyTables so soon :wink: You could use pydap [1] for
serving PyTables files through the Data Access Protocol [2] and then
using one of the DAP adapters [3] for your preferred language on the
client side.

[1] http://pydap.org/
[2] http://opendap.org/
[3] http://opendap.org/download/index.html

In order to adapt pydap better to your needs, you could even modify
the PyTables plugin (it is very easy to understand) for pydap and
taylor it to your needs.

Also, and in addition to the (excellent) advices that Ray has already
given to you, look at my comments interspersed in your message.

> I'll appreciate any responses, insights or pointers you might
> provide. Thanks and best wishes for the holidays,
>
> Tito.
>
> --
>
> A description of the data and its use
>
> The data is all timestamped financial streams of "tick" data. Each
> record is small (a few hundred bytes at the most), but there are
> many - in a day you may see many hundred million to a few billion.
> Each record is naturally partitioned by instrument (eg,
> "microsoft", "ibm", "dec crude", etc). There are less than 30K
> instruments in the universe I might care about.
>
> I (more or less) don't care how long it takes to construct the h5
> files/structures as it will be performed offline and the only
> critical query I care about is something like:
>
>
> "Get ticks for instruments {i1...in} from time t1 to time t2
> ordered by time, instr".
>
> That is, I need to be able to "replay" a subset of the instruments
> within the data store over some period of time. But I really care
> that this be as fast as possible.
>
> Questions
>
> 0. Am I barking up the wrong tree? Is HDF5 an appropriate
> technology for the use I've described?

In my experience, HDF5 is perfectly appropriate to cope with this.
It is, as many other things, just a matter of properly directing it
to do what you want to do :wink: In particular, your volume of data
seems that is going to be very large, so you should be very careful
when choosing the different parameters for your application.
Remember that experimenting is your best friend before putting code
into production.

> 1. Given the size/volume of the data, my thought is to partition h5
> files by day. Uncompressed, the files will be on the order of
> ~25G. Does this sound reasonable? What are the key factors
> impacting this decision from an hdf5 perspective?

25GB is a completely reasonable figure for a single file. Moreover,
by using compression, you can reduce this even more, so I see not
problem on this. However, how data is organised inside the file
becomes very important for handling data efficiently (see below).

> Two alternative models come immediately to mind: one big table
> (OBT) per day ordered by instrument and then time, or one table per
> instrument (OTPI) ordered by time. My current inclination is OTPI
> as it seems more manageable assuming the overhead of so many tables
> isn't an issue.
>
> 2a. Are there other, better models you suggest I investigate?
>
> 2b. With the OBT, I'd need to be able "index into" the table to
> identify the beginning of each instrument's section (at least).
> How would you recommend doing this? It seems possible to do this
> with references or perhaps a separate table with numerical indices
> into the main table. Any pros/cons/alternatives to these
> approaches?
>
> 2c. With the OTPI, I'd need to have many tables (at most ~30K) per
> file. Is this an issue?
>
> 2d. For both models, I'd need to be able to merge sorted sets of h5
> data into one sorted set as quickly as possible. Is there any hdf5
> support for doing such a thing or external libraries created for
> this purpose?

Effectively, both OTPI and OBT has its advantages and drawbacks.
OTPI, for me, has the disavantage of requiring a respectable amount
of tables for your case (having 30000 datasets in a single file is
nothing to sneeze at), as well as requiring a somewhat more
complicated query code. The OBT one is probably a simpler and better
approach in that HDF5 has to deal with less metadata (the 30000
datasets are avoided), so it is probably faster, but you may need
additional logic in your programs to perform fast queries (indexing
will help a lot indeed). I'd recommend doing your own experiments
here.

Regarding merging sorted datasets, you can always implement a typical
merge sort for your needs. Otherwise, you may want to use the
sorting capabilities of PyTables Pro that can handle arbitrarily
large tables.

> 3. What impact on retrieval/querying should I expect to see with
> varying levels of compression?

See:

http://www.pytables.org/docs/manual/ch05.html#searchOptim

for some experiments that I've recently done on this. They are meant
for PyTables, but you could find interesting hints for your case too.

> 4. Any suggestions on chunksizes for this application?

See:

http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune

for more experiments in that regard.

> Many thanks for any insights you might provide!

Hope that helps, and Merry Christmas!

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.