Hi Tito,
Yes, I've adopted the OTPI approach with further refinements I do mine per
day, and further do it by quote, trade, and market depth. HDF5 is designed
for any level of directory trees so make use of them for optimizing and
cataloguing the data. The attached picture shows a simple tree to minimize
directory size at any depth. This shows my daily bar repository. Other
directories contain my basket trades using quotes and trades. Any number of
directories are possible.
Chunksizes would be something to test with. I simply pulled a number out of
the hat. Mine may not be optimal, but it works for the decompression until
I can get around to optimizing things.
For the simulation side of things, I wrote up something at
http://www.oneunified.net/blog/OpenSource/Programming/minheap.article\. I
use in-memory arrays as I have enough memory available to perform the
multi-symbol merge. In actual fact, I don't do the merge before hand, I do
a run-time min-heap merge. This could be readily expanded to pulling data
from the files on an as needed basis for when data does not fit within
memory.
I make use of C++ templates to handle quotes and trades with the same code.
The code is duplicated for each class type, but I prefer speed optimization
over space optimization.
I've followed a reasonable object oriented hierarchy. CQuote, CTrade, CBar,
CMarketDepth are the basic data types derived from a class called
CDatedDatum. These classes define how they will store themselves in the
HDF5 files.
There is then a set of time series classes CQuotes, CTrades, CBars,
CMarketDepths, Iterators are provided in each for retrieving and interating
through elements based upon an integer or datetime index. The time series
can be saved or retrieved from an HDF5 table. Appending and over-writing is
also possible.
These operations rely on separate HDF5 Container, Accessor, and Iterator
templates I've written for facilitating saving, retrieving, and iterating
through on disk time series.
My total dataset is only at the 2GB size. I've noticed that some HDF5 tools
don't like working with files that have been split, including HDFView. I
recommend against using that option. ie data.000.hdf5, data.001.hdf5, where
each file is of a fixed maximum size. It might be better to handle file
splitting manually or use one massively big file. But a massively big file
may present archival/backup/copying problems. If your program is not
completely crash proof, especially during development, hdf5 files can become
corrupted, and may or may not be fixable. I'd recommend against a single
massive file. If you are dealing with 30k instruments, then perhaps a file
per day might be appropriate.
Hope this helps.
···
_____
From: Tito Ingargiola [mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 15:50
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] modelling tick data
Hi Ray,
Thank you for responding. It sounds like you have adopted my "OTPI"
approach, no? Do you have any idea at what point it will breakdown? That
is, is there an effective limit to the number of datasets/tables within one
file? How did you determine what chunksizes to use? Could you detail your
merging/sorting a little bit further? In particular, you mention putting
selections into memory, but in some cases I care about this would definitely
exhaust memory, so I need something that will perform a sorted merge as it's
reading from hdf5.
Thanks for your help,
Tito.
_____
From: Ray Burkholder <ray@oneunified.net>
To: Tito Ingargiola <tito@puppetmastertrading.com>; hdf-forum@hdfgroup.org
Sent: Tuesday, December 23, 2008 1:15:26 PM
Subject: RE: [hdf-forum] modelling tick data
I've been writing a Microsoft Visual C++ Automated Trading Program. I've
used HDF5 for the data storage component. I use chunked storage to
facilitate on the file compression/decompression. I've used the built-in
HDF5 ability to break one large file into separate chunks. I use the HDF5
built-in directory to segregate data by symbol and day. By using the
built-in ability to link files, 'virtual directories' can be built up. I've
been collecting quote and trade data.
By creating the record structures in C++, they can easily be mapped into
HDF5. I use the boost::date library for high resolution datetime stamps.
Using some customized STL concept code, I can use b-tree searches on my data
for selecting datetime ranges. The ranges get read into memory structures
for further processing and organization. I store instruments by day.
Higher level code needs to worry about aggregating multiple days, if that is
needed.
My scale of data collection is no where near as extensive as yours
is/mightbe. But I think that with appropriate tuning, and clever
programming, you can get what you want. Just make sure that whatever you
do, is compiled. When one gets several hundred thousand quotes/trades for
an instrument per day, the sheer volume of data takes a while to I/O. If
compression is used (there are a few clever concepts HDF5 can use for when
sequential values vary only slightly in the last few bytes), that requires
additional horsepower.
If you are interested in details, let me know.
Ray
From: Tito Ingargiola [mailto:tito@puppetmastertrading.com]
Sent: Tuesday, December 23, 2008 13:58
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] modelling tick data
Hi,
I'm trying to figure out how to best use hdf5 for my data. I've been
experimenting with various options but there seem to be many, many different
ways to model things and no relevant examples that I have come across.
Below I describe the data and its primary use as well as some questions
about how I might most effectively model it within hdf5. I'm using the C
interface and, to the degree possible, would like to use the HL interfaces
as much as possible. Utlimately, I will also need to access this data via
Java in some cases and believe that my best bet is to write the storage and
query code in C and then use SWIG/JNI to access this via Java. (This is
based on prototyping I've done and my assessment of the current Java hdf5
interface.) Thus, using pytables doesn't seem applicable for my
circumstance.
I'll appreciate any responses, insights or pointers you might provide.
Thanks and best wishes for the holidays,
Tito.
--
A description of the data and its use
The data is all timestamped financial streams of "tick" data. Each record
is small (a few hundred bytes at the most), but there are many - in a day
you may see many hundred million to a few billion. Each record is naturally
partitioned by instrument (eg, "microsoft", "ibm", "dec crude", etc). There
are less than 30K instruments in the universe I might care about.
I (more or less) don't care how long it takes to construct the h5
files/structures as it will be performed offline and the only critical query
I care about is something like:
"Get ticks for instruments {i1...in} from time t1 to time t2 ordered by
time, instr".
That is, I need to be able to "replay" a subset of the instruments within
the data store over some period of time. But I really care that this be as
fast as possible.
Questions
0. Am I barking up the wrong tree? Is HDF5 an appropriate technology for
the use I've described?
1. Given the size/volume of the data, my thought is to partition h5 files by
day. Uncompressed, the files will be on the order of ~25G. Does this sound
reasonable? What are the key factors impacting this decision from an hdf5
perspective?
Two alternative models come immediately to mind: one big table (OBT) per day
ordered by instrument and then time, or one table per instrument (OTPI)
ordered by time. My current inclination is OTPI as it seems more manageable
assuming the overhead of so many tables isn't an issue.
2a. Are there other, better models you suggest I investigate?
2b. With the OBT, I'd need to be able "index into" the table to identify
the beginning of each instrument's section (at least). How would you
recommend doing this? It seems possible to do this with references or
perhaps a separate table with numerical indices into the main table. Any
pros/cons/alternatives to these approaches?
2c. With the OTPI, I'd need to have many tables (at most ~30K) per file.
Is this an issue?
2d. For both models, I'd need to be able to merge sorted sets of h5 data
into one sorted set as quickly as possible. Is there any hdf5 support for
doing such a thing or external libraries created for this purpose?
3. What impact on retrieval/querying should I expect to see with varying
levels of compression?
4. Any suggestions on chunksizes for this application?
Many thanks for any insights you might provide!
--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.