We use HDF5 to store hundreds of very large one-dimensional data sets - like very high frequency audio data where each data set has its own frequency. These large data sets whose final size is not known at the time of creation are
chunked and compressed
split into multiple, free-standing HDF5 files at given time points
Recently the option has arisen for us to receive the input data in compressed, fixed-size chunks. This is perfect for the direct chunk write API which gives us a very welcome and significant performance increase. However, as soon as we run over into the second file, the fixed size chunks that we are receiving are no longer necessarily aligned with the chunking of the data sets in that file. For example, at the end of the first file a data set may only have space for, say, 100 more samples until the time for the next file is reached. But the chunks contain, say, 200 samples. We can unpack the chunk that spans the boundary and write the last 100 samples to the first file. This gives us a valid edge chunk at the end of that first data set. But the 100 samples that are left over are the wrong size for the first chunk in the next file.
We have considered repacking the first chunk for the second file with prepended padding, writing the number of padding samples to an attribute and then ignoring the padding whenever we read. But this seems clunky and error-prone and does not work well with tools like HDFView that can’t know that those initial samples are padding.
Is it possible in some way to create a valid edge chunk at the beginning of a data set? If not, would it be difficult to implement and is this a patch anyone might be interested in?
Or is there anything else we can do that does not involve unpacking all the input chunks and repacking them into slightly different chunks?
This is a little confusing. I think what you mean is that you receive compressed chunks with a fixed count of samples per chunk (per dataset). In other words, you mean nominal chunk size (in terms of elements) rather than chunk size in bytes, right?
If I understand you correctly, then
The next chunk contains a mixture of samples from the current and next datasets.
The nominal chunk sizes of the current and next datasets are potentially different.
The nominal chunk size of the compressed chunk you’ve received is the nominal chunk size of the next dataset.
If that’s the case, why not create datasets as per nominal chunk size during acquisition, and then create a bunch of virtual datasets (VDS) with the corrected extents? A VDS is just a snippet of metadata that appears like a “real” dataset, which declares where (in which physical datasets) the elements for that time point begin and end. Does that make sense?
Many thanks for your reply. Apologies for the confusing explanation.
Yes, that’s right.
Correct. I have a chunk - part of which belongs at the end of a dataset in the current file and part of which belongs at the beginning of an equivalent dataset (same name, type, nominal chunk size etc.) in the next file.
No, the nominal chunk size remains the same for all of these equivalent datasets across the set of files.
Yes, the actual number of samples per chunk received is constant and equal to the nominal chunk size in the corresponding dataset in each file.
The problem occurs at the boundary between files where there is generally a chunk that spans the two files. That is, the time that denotes the split between the two files is somewhere in the middle of that chunk. Our idea was to decompress that chunk and write part of it to the dataset in file N, and the remainder to the equivalent dataset in file N+1. But that remainder no longer corresponds to the nominal chunk size of the dataset. The subsequent chunks would have the right number of elements again - it’s just the first one in the new file, the one that “spills over” from the previous file, that has the wrong size.
Nonetheless, if I understand your suggestion correctly, I think the use of virtual datasets should work in this scenario, too. I would do a direct chunk write for the chunks I am receiving to a “real” dataset. Then for the chunk that spans the boundary between two files, I would write that chunk in its entirety to both files. This would add superfluous samples to the end of one file and the beginning of the next. I could then use a virtual dataset to effectively crop the superfluous samples. Have I understood correctly?
Got it. Yes, VDS should work in this case as well, such as the trimming scenario you’ve described. You could write the chunk spanning the boundary between the files to both, but that’s not strictly necessary, because the source selections that make up a virtual dataset can come from datasets in different files. If the overhead of writing the chunk twice to different files is negligible, I’d go with the option that is logically simpler in your code. Files would also be more self-contained w/ a little duplication. Unless there’s a coherence problem, i.e., you plan on updating the elements in that halo, this might be the simpler option. With the non-duplication approach, there wouldn’t be a coherence problem. Read performance should be similar in both cases, unless your chunks are gigantic and you end up with a pattern where you always read the full boundary chunk but use only a small fraction of the data.
That’s extremely helpful. Thank you. My gut feeling is that duplication is probably easier for us, but we will evaluate both possibilities before committing. It’s great that virtual datasets provide an elegant solution to this problem. We really didn’t have them on our radar when considering our options.