“the best” of course depends very strongly on your use cases.
We’ve been using HDF5 for many years to store and access intraday market data; it is very effective when the number of ticks (i.e. trades, quotes, or book updates) per dataset is sufficiently high; otherwise you will drown in the overhead. This is easily achieved with the usual equity and futures data, but requires care with options data, as the majority of individual options series have very little data (something like an average of 10 trades/series/day for consolidated OPRA for those series with at least one trade on a given day).
In our experience (and applied to our particular usage), it is best to aggregate into a single HDF5 file all the series by date and base ticker (or almost equivalently the underlying symbol), but store ordered by the series and then time. You can then store (in attributes) a map from individual series ticker (osi or whatever) or equivalently (expiration, putcall, strike) to an offset and extent within the single set of columns. Note you wouldn’t want to store the metadata about each series (symbol, expiration, putcall, strike) as data columns, but rather as attributes indexed as with the map.
This sort of layout, combined with the necessary reference data external to HDF5 to help you find the right HDF5 for a given date and base symbol or series symbol, then enables you to retrieve the data for all series for the base symbol (along with the necessary indexing/attribute metadata) in a very efficient fashion; and with a little bit of API work, also allows you to pull the data out on a series-by-series basis by constructing the appropriate hyperslabs based on the offset/extent attributes mentioned above, when you are only after a small number of individual series.
In our work benchmarking this approach (accessing and processing the data from Java), we were able to get to within 2-4x slower than using KDB while performing the typical sort of binning and joining with underlying data operations, and achieve essentially the same level of storage efficiency.
I cannot speak to the Python side of your question. But this approach works well for us from C/C++/Java/Julia, and spares those who cannot stomach Q from having to use KDB!
stephen
···
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Dan E
Sent: Sunday, October 26, 2014 12:54 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Best structure for an Hdf5 file to store historical data and how to access quickly from python
I have equity options historical data, now in csv files - one file for each day, that I'd like to store in one or multiple h5 files for efficient resource usage and fast access. One row of data is about a dozen columns (date, symbol name, options characteristics (maturiy, exercise price etc..), price (open/high/low/close), volume + some other infos). One day is about 700,000 rows accross 4,000 different symbols.
Now here's the thing, I'd like to be able to quickly retrieve the data in "both directions", meaning: - request for a specific day, all options on a particular symbol or list of symbols - request for a specific option (or option list) the time series of historical prices between two dates.
What would be the ideal structure of the data? One vs multiple files? Group hierarchy? Python modules you recommend for accessing the data? (ideally pandas)
I tried a first implementation creating for each day a Pandas DataFrame indexed by symbol name and then storing it under a separate h5 file under one root group 'OPTIONS' Then to access the data on a specific day for a specific symbol list, I would do :
h5 = pd.HDFStore(filename)
opt_df = h5['OPTIONS'].ix[symbol_list]
I would then slice opt_df using masks to finally get the options I really want.
The problem though is: - I usually need only to access a small fraction of the 4,000 symbols (typically a couple at a time) and with this method it loads the whole daily file in a DataFrame which causes significant overhead - Although it is reasonably fast if I just need data for a specific day, it becomes extremely painful if I want to access a specific contract over a period of time as I would then have to load the file for each of the requested dates.
How would you implement this? Also keep in mind that I am constantly adding more data (every day).
Thanks a lot for your help
######################################################################
The information contained in this communication is confidential and
may contain information that is privileged or exempt from disclosure
under applicable law. If you are not a named addressee, please notify
the sender immediately and delete this email from your system.
If you have received this communication, and are not a named
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
######################################################################