Chunking for Simulation TimeSeries Data

Hi,

I’m working with a simulation that uses an old version of HDF (1.8.21) and writes timeseries data in the following data chunks:

[1 timestep] x [# variables] x [# items]

This is the worst possible chunking layout for timeseries data access. There are simply too many reads when dealing with hundreds of thousands of timesteps.

However, I’m not sure what the alternatives are since I need real-time access while the simulation is progressing. So I don’t think caching thousands of timesteps before writing will work. How fast it takes for the simulation to write timestep data is non-determinant.

I’ve repacked the data when the simulation is complete but this is obviously not a valid solution for a consumer application, especially considering how slow repacking takes.

How does one actually optimally chunk timeseries data that is dynamic in nature?

Thanks.

Hi, @karim.nogas !

This is an interesting problem but I wish I could know more details about the characteristics of your data and code.

First, do # variables and # items vary for each timestep? For example, some sensors (variables) don’t generate data while others do.

Second, would you please share a sample HDF5 code that the simulation uses?

Third, when you say “too many reads”, do you mean the number of HDF5 API read calls?

Finally, may I ask you a favor? Would you please try the latest HDF5 (1.14.4.2) and see if the result is same?

If so, would you please try other solution like questdb.io and see if it can be used in your software stack along with HDF5 solution? That is, use QuestDB as an intermediate step and archive data later in HDF5 from the QuestDB.

Thank you so much for posting an interesting problem!