Retroactive Cloud Optimization Using h5repack

Based on the lecture given by Aleksandar Jelenak, I believe there are four main things that can be done to reformat HDF5 files in order to speed up cloud access:

  1. Switch to a PAGE layout and increase the page size (to an optimal size)
  2. Increase the page cache size (to an optimal size)
  3. Increase the chunk size (to an optimal layout)
  4. Increase the chunk cache size (to an optimal size)

I would like to use the h5repack utility to perform these tasks for previously created H5 datasets, but I’m having a tough time determining the syntax and capabilities of h5repack.

Using a command with this format:

h5repack -S PAGE -G 9999 -l /subpath/to/dataset:CHUNK=9999x9999 input.h5 output.h5

I can accomplish tasks 1 and 3, but I can’t figure out how to increase the sizes of the caches (tasks 2 and 4). Can h5repack be used to perform tasks 2/4 and if so, what is the syntax?

Hi @ffwilliams2,

Thanks for watching my video. Your four-point list is a correct summary when one wants to get the most out of the current libhdf5 for accessing HDF5 files in cloud object stores.

Items #2 and #4 on the list are libhdf5 runtime settings and h5repack cannot be used for that. I’ll give examples for HDF5 API and h5py.

To set dataset chunk cache:

To set page buffer size:


1 Like

Thank you for the information @ajelenak! I’d prefer to use h5py if setting buffer/cache sizes is not possible via h5repack.

Do I understand correctly that the rdcc_nbytes and page_buf_size properties are not inherent properties of the HDF5 file on disk, but are instead read configuration settings that are only specified when reading the data? Would this workflow be appropriate:

  1. Set the PAGE size and dataset chunking using h5repack:
h5repack -S PAGE -G 9999 -l /subpath/to/dataset:CHUNK=9999x9999 input.h5 output.h5
  1. When reading the HDF5 file set the buffer/cache configuration settings:
import h5py

with h5py.File('output.h5', mode='r', page_buffer_size=2**21, rdcc_nbytes=2**26) as file:
    dataset = file['subpath']['to']['dataset']
    desired_data = dataset[0:100, 0:100]

Yes, you are correct. Dataset chunk caches and page buffer cache are libhdf5 runtime in-memory caches, nothing to do with HDF5 files.

I assume the -G 9999 is just a placeholder file page size value. Note that you specify file page size in bytes but new dataset chunks by their shape. This means the dataset chunk size in bytes will be a product of the total number of HDF5 dataset elements in the chunk and that dataset’s datatype size in bytes.

Typically you want something like 8-16MiB for a page size so it can hold 4-8 dataset chunks. And the page buffer cache should be large enought to hold all the internal file metadata pages plus at least several data pages. Best is to decide on the new larger chunk sizes (in bytes) first and then work out the page and cache sizes. One-size-fits-all may not be the best approach and tuning dataset chunk sizes and their caches per dataset may be needed to avoid large memory use.