Storing stock data tick by tick, best way?

Hi,
I’m transalting csv stock data file tick to hdf5 file to mange efficently and rapid acess to the data,
I’m undecided whether to create:

  1. a single file for all stocks with a daily table for each share
  2. one file a day with data from all stocks divided into tables
  3. a file for each stock with a daily table
    my main problems are to be able to access data quickly by selecting the time frame for different stocks
    example:
    1 min for 10 days of 5 stocks
    1 min for 5 years of 5 stocks
    I have already considered using a hd ssd 1tb (expandable) dedicated to managing only the db
    I do not intend to use data compression because it would involve a problem in finding the data quickly and i think to use UTC time
    some suggestions?

check out H5CPP.ca, an easy to use template library, which is specifically designed for event based datasets such as HFT, supports most linear algebra libraries as well std::vector<your_pod_struct>.

You could model data with an extendable dataset: cube or higher dimensions, and cross cut it to get the desired slice. Armadillo supports cubes out of the box; h5cpp partial read/write can map higher dimensional slices into vectors, matrices and cubes.

h5::create(fd, {nminutes_in_day, ninstruments, H5_UNLIMITED}; // an extendable cube

A single dataset may be a good representation for read/write dataset, but harder to see how you organize trading days, etc… Therefore I prefer keeping HFT IRTS datasets in daily single streams and similarly RTS datasets in daily single matrices. Then use circular buffer to stitch them into a clock, which you can roll along days with a given set of days in view.

Tables are non homogeneous datasets, and clumsy to use straight out, nor can you place them into a linalg container, they are better to be represented as floats/doubles.

best wishes

steven

1 Like

We use the python library dask (installs easily with anaconda python). This lets you define datasets / arrays across memory and files using e.g. dask.array.concatenate. Then you can lay out files as you want across timeframes and disk systems.

Personally, on my desktop PC, I use the btrfs filesystem which can span multiple drives. Then you can get the disk(s) that make sense for you and don’t worry about physical devices too much.