HDF5 chunk size not a divisor of the volume size

shaomenglee · June 6, 2024, 12:28pm

Hello,

I’m trying to find documentation on HDF5 behavior when a specified chunk size is not a divisor of the volume size, but cannot find any mention of it. For example, when the chunk size is specified as 11x11x11 on a volume of 20x20x20. What is the supposed behavior here?

Thanks,
Sam

bmribler · June 7, 2024, 4:18pm

Hi Sam,
Does this document help?

Thanks,
Binh-Minh

shaomenglee · June 7, 2024, 4:35pm

Hi @bmribler , I did look into that documentation, and the only place that mentioned this is a single sentence: “the chunk size should be set to the selection size, or an integer divisor of it. This recommendation is subject to the guidelines in the pitfalls section; specifically, it should not be too small or too large.”

The language seems to suggest that chunk sizes are recommended to be an integer divisor of the full volume size, but not requiring it. If so, then how does HDF5 handle the case in the question, i.e., the chunk size is specified as 11x11x11 on a volume of 20x20x20?

dave.allured · June 7, 2024, 7:29pm

Current documentation is not complete. Here is the default HDF5 behavior, which I determined partly from an experiment with HDF5 1.14.0.

For an uncompressed dataset, all chunks are physically stored the same size on disk, even the so-called edge chunks which are partially outside of the logical extent of the array. Therefore, your 20x20x20 example would be stored in 8 full size chunks, with some wasted disk space.

For a compressed dataset, all chunks are efficiently compressed on disk, including edge chunks. This means that the concept of the logical storage size on disk does not apply to any chunks, including edge chunks.

HDF5 1.10 introduced a special option to disable compression for partial edge chunks, to improve performance when the same edge chunk might be overwritten repeatedly. Please see this archival document: Partial Edge Chunk Option

Also note that when pre-fill is not enabled, disk space for each chunk will not be written or allocated on disk, until at least one value is written into that chunk. This must be taken into account when experimenting with disk strategies.

shaomenglee · June 7, 2024, 7:53pm

Hi @dave.allured , thanks for the explanation. I wasn’t aware of the distinction between “partial edge chunks” and regular chunks. The documentation makes great sense.

As an HDF5 filter developer, I’m thinking how to write a filter that applies to both normal and partial edge chunks. Currently, I query the chunk size from the dataset property list using H5Pget_chunk(), so on partial edge chunks, my compressor would fail because the actual data has a smaller size.

One way to avoid the failure is to set the partial edge chunk option to not apply the filter. Do you think it’s also possible to query the actual chunk size on every chunk, no matter partial or not, so that the compressor always receives the correct parameters?

Thanks,
Sam

dave.allured · June 15, 2024, 3:26am

The HDF5 library provides buffer size information, every time a filter is called. Use this information, not the raw chunk size, for buffer management within your filter. See the BZIP2 plugin filter code for a good example of a buffer management strategy.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 chunk size not a divisor of the volume size