Storing stock data tick by tick, best way?

giuseppepaganos · May 19, 2018, 6:30pm

Hi,
I’m transalting csv stock data file tick to hdf5 file to mange efficently and rapid acess to the data,
I’m undecided whether to create:

a single file for all stocks with a daily table for each share
one file a day with data from all stocks divided into tables
a file for each stock with a daily table
my main problems are to be able to access data quickly by selecting the time frame for different stocks
example:
1 min for 10 days of 5 stocks
1 min for 5 years of 5 stocks
I have already considered using a hd ssd 1tb (expandable) dedicated to managing only the db
I do not intend to use data compression because it would involve a problem in finding the data quickly and i think to use UTC time
some suggestions?

steven.varga · May 19, 2018, 8:30pm

check out H5CPP.ca, an easy to use template library, which is specifically designed for event based datasets such as HFT, supports most linear algebra libraries as well std::vector<your_pod_struct>.

You could model data with an extendable dataset: cube or higher dimensions, and cross cut it to get the desired slice. Armadillo supports cubes out of the box; h5cpp partial read/write can map higher dimensional slices into vectors, matrices and cubes.

h5::create(fd, {nminutes_in_day, ninstruments, H5_UNLIMITED}; // an extendable cube

A single dataset may be a good representation for read/write dataset, but harder to see how you organize trading days, etc… Therefore I prefer keeping HFT IRTS datasets in daily single streams and similarly RTS datasets in daily single matrices. Then use circular buffer to stitch them into a clock, which you can roll along days with a given set of days in view.

Tables are non homogeneous datasets, and clumsy to use straight out, nor can you place them into a linalg container, they are better to be represented as floats/doubles.

best wishes

steven

pletnes · May 21, 2018, 10:15am

We use the python library dask (installs easily with anaconda python). This lets you define datasets / arrays across memory and files using e.g. dask.array.concatenate. Then you can lay out files as you want across timeframes and disk systems.

Personally, on my desktop PC, I use the btrfs filesystem which can span multiple drives. Then you can get the disk(s) that make sense for you and don’t worry about physical devices too much.