I was wondering if it’s possible (or reasonable) to create a partial chunk of the beginning of a dataset. For example, I’ve got a dataset with one dimension of time. I want to keep all of the chunk periods the same, but I’m periodically prepending more data to the beginning of the dataset (and potentially changing values in other chunks).
What advice would you give for such a use-case?
Hi,
Thanks for your question!
I’m not quite sure what you mean by “partial chunk” … in HDF5 if you have a chunked dataset, new chunks will get allocated as you write to it. Depending on the relative sizes of the dataset and chunk dimensions, you may have “edge chunks” at the end. Say the chunk shape was (100, 100) and the dataset shape was (200, 150). You’d have at most 4 allocated chunks and the two right-most ones would have 100x50xdatatypesize bytes that would be stored in the file (assuming no compression) that wouldn’t ever have any data.
As a rule it’s a good idea to have the dataset extent divisible by the chunk extent to avoid this. In any case, partial chunks will always be at the end of the dataset, not the front.
When you prepend data, do you mean you are adding data to the front of the dataset and shifting the other elements down? That will be an expensive operation. Can you just append to the back and logically reorder the sequence of elements (i.e. newer elements will always be to the right of older elements)?
I’ll try to make a simple example of my problem.
Let’s say I have water temperature data modelled over time (at a specific point in space, but let’s ignore the space for this example). I initially have data starting from 2021-01-01 to 2021-12-31. In HDF5, I create a single dimension dataset for the time data and a single dimension dataset for the temperature data and associate it with the time label. In the process, I need to give both datasets a chunk size (and other parameters). As you mentioned, if the chunk size doesn’t fit evenly into the total length of the dataset, then there will be “edge chunks” at the end.
The question is, what happens when I try to add new data prior to 2021-01-01? To keep the dataset ordered in time, I’d need to prepend the data to the front of the dataset, but as you mentioned it’s probably a computationally expensive task because (I assume) HDF5 needs to rework all the chunks to realign them to the newly update dataset. I guess if order doesn’t matter, your suggestion regarding appending the new data to the end makes sense as only the edge chunks (and new chunks) would only need to be updated rather than all chunks. I was just hoping to find some way to keep the data ordered in time, but not have to rework all of the existing chunks. Internally, I assume that the hash of each chunk is stored against it’s logical position in the dataset, so that the HDF5 API knows where each chunk should go. It feels like it should be possible to add more chunks and tell HDF5 to put it in front of the other chunks, but I’m not knowledgeable enough in the low-level API to know. If it’s not possible, then that’s fine; I just wanted to make sure I wasn’t missing something.
I’ m sorry, but there’s no easy way to add data to both ends of a one-dimensional dataset. I think you’ll be forced to do a data move one way or the other.
It would be an interesting feature to allow negative dataset extents. i.e. say you start off with a dataset with dimensions 0:10, need to prepend 10 elements and extend the dataset in the lower bound, so you wind up with dimension -10:10.
I don’t recall this particular proposal coming up before, but it would seem useful. Likely it will require some effort to implement though.
Yes, using virtual datasets is an idea. Depending on how many updates you will be doing, there will be a certain amount of overhead, so verify the performance will work for you.
Another idea would be to define element 0 as the earliest possible time you would ever have. With chunked datasets, storage will only be allocated when you write to these values. You’ll need to come up with a method to indicate when the “real” data starts. You could use fill values or an attribute giving the starting index.
I like @jready’s idea to define element 0 as the earliest possible time, taking advantage of HDF5’s built-in allocate on each chunk write. This is effectively a type of sparse data storage. This is also the simplest strategy because it uses only the basic HDF5 API and does not tinker with internals.
A related strategy would be to append all new time steps at the end of the physical data set on an unlimited time dimension. Records would no longer be in chronological order, but the file is now as compact as possible with no empty records. Keep the date/time values in a separate small “index variable” alongside the data array. Use the index variable to retrieve records in chronological order as needed, when reading data back out later. This strategy also avoids advanced features and tinkering with internals. This strategy would also allow unlimited “prepending” with no risk of overflowing an initial estimate of the start time.
Thank you all very much for your advice! I really appreciate the discussions! I do like to push some software to it’s limits
I do like the idea of initially creating a dataset that starts really early at a known consistent start and only fill the chunks with data as needed. This will also more easily allow for the modification of already written chunks later since all of the chunks will be set in stone at dataset creation, which wouldn’t be the case if I simply appended more data to the end (out of chronological order).
I’m assuming that there isn’t too much extra disk space required to create a larger than immediately necessary hash/B-tree index table for all those extra “empty” chunks.