Query timeseries data

schachtobi · May 2, 2019, 2:20pm

I like to store data that has the following structure

| time UINT64 | ch1 float | ch2 int | ch3 uint | ch4 bool |
| 100         | 22.3      |     -44 |  55      |  true    |
| 120         | 22.3      |     -44 |  55      |  true    |
| 125         | 22.3      |     -44 |  55      |  true    |
| 130         | 22.3      |     -44 |  55      |  true    |
| ...         | 22.3      |     -44 |  55      |  true    |
|50000000     | 22.3      |     -44 |  55      |  true    |

The dataset has only one column with a compund datatype (time being the first component).

Now I would like to be able to do a selection of rows that are in the range time[5000,6000). I can think of several ways to acomplish this but I am not sure if this is the prefered way:

I could create another dataset that only stores the timestamps at the same indices as the data. Then when querying the range I first do a binary-search lookup in this time table to get the indices and then gather the data at these indices from the (compound) data dataset.
Since the values in the time columns are monotonic I could to a binary search on the table by selecting values to be compared first with H5Sselect_hyperslab to get the relevant row indices. This has the drawback that I need to read many chunks from disk so I expect this not to be very efficient. Is this assumption correct?
The pytable project uses indices to speedup these kind of searches using the opsi indexes. I read the paper here. Unfortunately I can not use these kind of indexes since I want to be able to consume the data from a c++ application. And as far as I know there is no c++ implementation of the opsi indexes available inside the hdf5lib. Is this correct? Another drawback is that as far as I know compound datatypes are not supported. Right?
Another approach would be to store the timestamp together with the row index in a relational database. This could be postgres or sqlite. When querying the data I would first lookup the row indices in the database and then after selecting the hyperslab get the data out of the hdf5 file.

As you can guess I am a novice in using HDF5 and I would be very happy if someone with more experience could guide me to the optimal solution for this usecase.

Questions (summary):

What solution should I pick? 1,2,3,4, other?
When 2: Is this approach performant?
When 3: Is there a usable version for C++ and compound datatypes? Are there usable alternatives?
When 4: Feasible?

gheber · May 2, 2019, 2:35pm

Is the “table” fixed or expanding? What are you trying to optimize?

steven · May 2, 2019, 3:00pm

For HDF5 C++ interface check out H5CPP, a modern tuned implementation that supports packet tables with h5::append operator.

For number 3.
H5CPP comes with compile time reflection through its LLVM based code transformation tool; successfully used in HFT environment, the generated HDF5 COMPOUND descriptor is guaranteed to be minimal by internal graph algorithm. You can find Gerd Heber’s slides on this webpage and scroll through these examples especially the compound dataype and packet table implementations.

As for indices: there is no perfect solution and due to time constraints I can’t go through the paper you mention; however you easily can build your own.
You are good with number ‘1. and 2.’ idea, as you have to maintain a sorted index for time series. I suggest you to keep it simple, and dense.

For HFT I prefer storing time stamp in the records in one single stream: bids,asks,trades together, in order of arrival. – some markets have additional datastream for out-of-order executions, so keep that in mind.
Then in separate datasets for each day I store RegularTimeSeries with a fixed interval – seconds or minutes.

best wishes: steven
the author of H5CPP

schachtobi · May 2, 2019, 5:38pm

The table should be appendable, so I will use the chunked storage format. I try to optimize both the insert rate and the query rate. The insert rate is roughly 100Mbit/s. The query time depends on the granularity. I plan also to store the data at lower sampling rates to get an overview-graph (for long time ranges e.g. 1Week) without the need to downsample live.

schachtobi · May 2, 2019, 5:52pm

Sounds good. I will definitely try H5CPP. The reflection suggestion is probably not needed since I store my configuration about the datatypes in a flatbuffers configuration.

Indices: So I think I first try implementing an index table for time first (1)
Thanks

steven · May 2, 2019, 6:31pm

The insert rate for H5CPP h5::append is near HDD speed, it uses the newest HDF5 CAPI direct chunk write calls. On my Lenovo x230 ~ 300-400MByte/sec. You can find the performance in profile/throughput directory.
Yes, it is simpler to have RTS at different time interval then resample the IRTS stream.

steven · May 2, 2019, 7:15pm

FYI: here is a possible layout that works across multiple stat platforms: Julia/Matlab/R/Python. RTS is organised into homogeneous container or matrices, IRTS is recorded as HDF5 Compound Type. Trading days are encoded into directories, which easily iterated in stat packages, and used for modelling. IRTS is for backtesting.
Indices are plain ut8 vector data types. You could build similar indices for time stamps if you think that is a good idea. However because of the high throughput that exceeds posix IO calls often cheaper to load the dataset and do a binary search on the given IRTS stream.

hope it helps,
steve

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Query timeseries data