Dataset Not a Multiple of Chunk Size: How is it Handled?

drozdowski.chris · June 10, 2019, 3:16am

I am working on an application that writes a double type dataset with rank of 1 to a file. That dataset can contain an arbitrary number of values that can vary each time it is written.

I want to keep the cache size at most 1MB (1024 x 1024 bytes). If the size of a double is 8 bytes, that means that the dataset can be chunked and compressed up to 131,072 values ((1024 x 1024) / 8). Unless there is some overhead I’m not taking into account.

Given that information, if the dataset actually contains more than 131,072 values but less 262,144 values (not quite a full second chunk), how are the values that exceed that chunk size handled? Are they in a chunk? Are they compressed?

Thank in advance for the input!

steven · June 10, 2019, 3:49am

Hi Chris,

when using H5CPP the h5::append operator does the right thing. The total elements * dimensions * sizeof(datatype) = block size.

’Unless there is some overhead I’m not taking into account’ keep in mind some compression algorithm doesn’t guarantee the ‘compressed block’ be smaller than the input. Indeed in the CAPI pipeline the chunk is flagged as uncompressed and the compressed (but larger in size) block is discarded.

Chunk is chunk… once is compressed you will most likely end up with a variable length block. If you are to implement your own pipeline and want to handle input greater than the chunk size then you have to implement your own blocking. ie. H5CPP uses the same cache aware blocking mechanism used in BLAS.
Keep in mind that the edges are finicky: you have to do your own fill values on the fringes – which means you are loosing bandwidth when chunks are not fully utilised.

Did it help?

gheber · June 10, 2019, 12:46pm

In your example, a full second chunk will be allocated for the elements that don’t fit on the first chunk. (This is a so-called edge chunk.) Remember that ‘chunk’ is a storage layout concept and not a dataset-logical level primitive. The dataset extent (“current dimensions”) is what the application sees and the library will “do the right thing” as far as storage management is concerned. If you are using compression, and compression is effective, the “excess elements” of the edge chunk won’t hurt you as far as space is concerned. See https://portal.hdfgroup.org/display/HDF5/H5P_SET_CHUNK_OPTS for how you can control the treatment of edge chunks as far as filtering (compression) is concerned.

drozdowski.chris · June 10, 2019, 3:14pm

Thank you both Steven and Gerd,

It is clear to me now that the edge data goes into a chunk and that chunk is filled with specified or default fill values. It seems obvious when one reflects upon it. I suspect I was over-thinking it and got confused

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Dataset Not a Multiple of Chunk Size: How is it Handled?