We use HDF5 to store hundreds of very large one-dimensional data sets - like very high frequency audio data where each data set has its own frequency. These large data sets whose final size is not known at the time of creation are
- chunked and compressed
- split into multiple, free-standing HDF5 files at given time points
Recently the option has arisen for us to receive the input data in compressed, fixed-size chunks. This is perfect for the direct chunk write API which gives us a very welcome and significant performance increase. However, as soon as we run over into the second file, the fixed size chunks that we are receiving are no longer necessarily aligned with the chunking of the data sets in that file. For example, at the end of the first file a data set may only have space for, say, 100 more samples until the time for the next file is reached. But the chunks contain, say, 200 samples. We can unpack the chunk that spans the boundary and write the last 100 samples to the first file. This gives us a valid edge chunk at the end of that first data set. But the 100 samples that are left over are the wrong size for the first chunk in the next file.
We have considered repacking the first chunk for the second file with prepended padding, writing the number of padding samples to an attribute and then ignoring the padding whenever we read. But this seems clunky and error-prone and does not work well with tools like HDFView that can’t know that those initial samples are padding.
Is it possible in some way to create a valid edge chunk at the beginning of a data set? If not, would it be difficult to implement and is this a patch anyone might be interested in?
Or is there anything else we can do that does not involve unpacking all the input chunks and repacking them into slightly different chunks?
Thanks in advance.