Best structure for an Hdf5 file to store historical data and how to access quickly from python


#1

I have equity options historical data, now in csv files - one file for each
day, that I'd like to store in one or multiple h5 files for efficient
resource usage and fast access. One row of data is about a dozen columns
(date, symbol name, options characteristics (maturiy, exercise price
etc..), price (open/high/low/close), volume + some other infos). One day is
about 700,000 rows accross 4,000 different symbols.

Now here's the thing, I'd like to be able to quickly retrieve the data in
"both directions", meaning: - request for a specific day, all options on a
particular symbol or list of symbols - request for a specific option (or
option list) the time series of historical prices between two dates.

What would be the ideal structure of the data? One vs multiple files? Group
hierarchy? Python modules you recommend for accessing the data? (ideally
pandas)

I tried a first implementation creating for each day a Pandas DataFrame
indexed by symbol name and then storing it under a separate h5 file under
one root group 'OPTIONS' Then to access the data on a specific day for a
specific symbol list, I would do :

h5 = pd.HDFStore(filename)

opt_df = h5['OPTIONS'].ix[symbol_list]

I would then slice opt_df using masks to finally get the options I really
want.

The problem though is: - I usually need only to access a small fraction of
the 4,000 symbols (typically a couple at a time) and with this method it
loads the whole daily file in a DataFrame which causes significant overhead
- Although it is reasonably fast if I just need data for a specific day, it
becomes extremely painful if I want to access a specific contract over a
period of time as I would then have to load the file for each of the
requested dates.

How would you implement this? Also keep in mind that I am constantly adding
more data (every day).

Thanks a lot for your help


#2

Hi,

Did you try to hash all strings into numerics, which you can use later for
indexing (symbols); name data blocks for example: 2014-01-01.mat ?

Adding chunking and setting the compression filter high covers performance
versus space complexity.

The data blocks should be layed out carefully: irregular series treated
differently from regular time series as indexing the later is a breeze.

For irregular time series you may use hdf5 custom data type with pocket
tables, this allows high speed serial access. For regular time series n
dimensional slabs of some data type fits better. As an example for candle
data: instrument:time:OHLC can do well with float type.

My set up is somewhat different from yours as it includes a parallel
filesystem + cluster +MPI+ phdf5 C api. The data size is much same.
Throughput/space is superb to SQL based approach.

Steve

···

On Oct 27, 2014 8:46 AM, "Dan E" <happydejd@gmail.com> wrote:

I have equity options historical data, now in csv files - one file for
each day, that I'd like to store in one or multiple h5 files for efficient
resource usage and fast access. One row of data is about a dozen columns
(date, symbol name, options characteristics (maturiy, exercise price
etc..), price (open/high/low/close), volume + some other infos). One day is
about 700,000 rows accross 4,000 different symbols.

Now here's the thing, I'd like to be able to quickly retrieve the data in
"both directions", meaning: - request for a specific day, all options on a
particular symbol or list of symbols - request for a specific option (or
option list) the time series of historical prices between two dates.

What would be the ideal structure of the data? One vs multiple files?
Group hierarchy? Python modules you recommend for accessing the data?
(ideally pandas)

I tried a first implementation creating for each day a Pandas DataFrame
indexed by symbol name and then storing it under a separate h5 file under
one root group 'OPTIONS' Then to access the data on a specific day for a
specific symbol list, I would do :

h5 = pd.HDFStore(filename)

opt_df = h5['OPTIONS'].ix[symbol_list]

I would then slice opt_df using masks to finally get the options I really
want.

The problem though is: - I usually need only to access a small fraction of
the 4,000 symbols (typically a couple at a time) and with this method it
loads the whole daily file in a DataFrame which causes significant overhead
- Although it is reasonably fast if I just need data for a specific day, it
becomes extremely painful if I want to access a specific contract over a
period of time as I would then have to load the file for each of the
requested dates.

How would you implement this? Also keep in mind that I am constantly
adding more data (every day).

Thanks a lot for your help

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


#3

“the best” of course depends very strongly on your use cases.

We’ve been using HDF5 for many years to store and access intraday market data; it is very effective when the number of ticks (i.e. trades, quotes, or book updates) per dataset is sufficiently high; otherwise you will drown in the overhead. This is easily achieved with the usual equity and futures data, but requires care with options data, as the majority of individual options series have very little data (something like an average of 10 trades/series/day for consolidated OPRA for those series with at least one trade on a given day).

In our experience (and applied to our particular usage), it is best to aggregate into a single HDF5 file all the series by date and base ticker (or almost equivalently the underlying symbol), but store ordered by the series and then time. You can then store (in attributes) a map from individual series ticker (osi or whatever) or equivalently (expiration, putcall, strike) to an offset and extent within the single set of columns. Note you wouldn’t want to store the metadata about each series (symbol, expiration, putcall, strike) as data columns, but rather as attributes indexed as with the map.

This sort of layout, combined with the necessary reference data external to HDF5 to help you find the right HDF5 for a given date and base symbol or series symbol, then enables you to retrieve the data for all series for the base symbol (along with the necessary indexing/attribute metadata) in a very efficient fashion; and with a little bit of API work, also allows you to pull the data out on a series-by-series basis by constructing the appropriate hyperslabs based on the offset/extent attributes mentioned above, when you are only after a small number of individual series.

In our work benchmarking this approach (accessing and processing the data from Java), we were able to get to within 2-4x slower than using KDB while performing the typical sort of binning and joining with underlying data operations, and achieve essentially the same level of storage efficiency.

I cannot speak to the Python side of your question. But this approach works well for us from C/C++/Java/Julia, and spares those who cannot stomach Q from having to use KDB!

stephen

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Dan E
Sent: Sunday, October 26, 2014 12:54 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Best structure for an Hdf5 file to store historical data and how to access quickly from python

I have equity options historical data, now in csv files - one file for each day, that I'd like to store in one or multiple h5 files for efficient resource usage and fast access. One row of data is about a dozen columns (date, symbol name, options characteristics (maturiy, exercise price etc..), price (open/high/low/close), volume + some other infos). One day is about 700,000 rows accross 4,000 different symbols.

Now here's the thing, I'd like to be able to quickly retrieve the data in "both directions", meaning: - request for a specific day, all options on a particular symbol or list of symbols - request for a specific option (or option list) the time series of historical prices between two dates.

What would be the ideal structure of the data? One vs multiple files? Group hierarchy? Python modules you recommend for accessing the data? (ideally pandas)

I tried a first implementation creating for each day a Pandas DataFrame indexed by symbol name and then storing it under a separate h5 file under one root group 'OPTIONS' Then to access the data on a specific day for a specific symbol list, I would do :

h5 = pd.HDFStore(filename)

opt_df = h5['OPTIONS'].ix[symbol_list]

I would then slice opt_df using masks to finally get the options I really want.

The problem though is: - I usually need only to access a small fraction of the 4,000 symbols (typically a couple at a time) and with this method it loads the whole daily file in a DataFrame which causes significant overhead - Although it is reasonably fast if I just need data for a specific day, it becomes extremely painful if I want to access a specific contract over a period of time as I would then have to load the file for each of the requested dates.

How would you implement this? Also keep in mind that I am constantly adding more data (every day).

Thanks a lot for your help

######################################################################
The information contained in this communication is confidential and
may contain information that is privileged or exempt from disclosure
under applicable law. If you are not a named addressee, please notify
the sender immediately and delete this email from your system.
If you have received this communication, and are not a named
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
######################################################################


#4

26.10.2014 21:53, Dan E пишет:

I have equity options historical data, now in csv files - one file for
each day, that I'd like to store in one or multiple h5 files for
efficient resource usage and fast access. One row of data is about a
dozen columns (date, symbol name, options characteristics (maturiy,
exercise price etc..), price (open/high/low/close), volume + some other
infos). One day is about 700,000 rows accross 4,000 different symbols.

Hi Dan!

It seems that indexed SQLite database would do the job for you.

HDF5 is best for homogeneous array-like primitive-type data. However, you are going to store tabular data. In any case, I recommend you to test proposed HDF5 solutions vs SQLite solution.

Best wishes,
Andrey Paramonov

···

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


#5

This is not necessarily true. Many people store this type of data in PyTables, which is the backing store for Pandas and uses HDF5 internally.

I'm travelling today, but I'll provide a real response to this in a day or two.

Dana

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of
??? ???
Sent: Monday, October 27, 2014 8:58 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Best structure for an Hdf5 file to store
historical data and how to access quickly from python

26.10.2014 21:53, Dan E пишет:
> I have equity options historical data, now in csv files - one file for
> each day, that I'd like to store in one or multiple h5 files for
> efficient resource usage and fast access. One row of data is about a
> dozen columns (date, symbol name, options characteristics (maturiy,
> exercise price etc..), price (open/high/low/close), volume + some
> other infos). One day is about 700,000 rows accross 4,000 different
symbols.

Hi Dan!

It seems that indexed SQLite database would do the job for you.

HDF5 is best for homogeneous array-like primitive-type data. However, you
are going to store tabular data. In any case, I recommend you to test
proposed HDF5 solutions vs SQLite solution.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-
forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


#6

Hello,
I am facing a similar issue and was thinking about storing the data with one hdf5 file per option expiry (so with all tickers whose option expire on that date).
What storing method did you end with please?