Explainer on outcome of "page_buf_size" value

The h5py.File argument page_buf_size is confusing me, and since there’s a lot of enthusiasm for cloud-optimization of HDF5 files, I thought asking for a brief public explainer would be useful.

The docs say the value “must be a power of two value and greater or equal than the file space page size when creating the file”. And yet … if I violate that guidance, I can see that page buffering is enabled.

Here’s the h5stat -S on a test file:

Filename: s3://nasa-cryo-scratch/itcarroll/cloud-PACE/PACE_OCI_L2_AOP/8388608/G4194150056-OB_CLOUD
File space management strategy: H5F_FSPACE_STRATEGY_PAGE
File space page size: 8388608 bytes
Summary of file space information:
  File metadata: 320158 bytes
  Raw data: 18571705 bytes
  Amount/Percent of tracked free space: 14658473 bytes/43.7%
  Unaccounted space: 4096 bytes
Total space: 33554432 bytes

Note the page size is 2**23. When I track the fsspec logs, I see this nicely reflected!

xarray.open_dataset(..., engine=“h5netcdf”, driver_kwds = {“page_buf_size”: 2}, open_kwargs={“cache_type”: “none”})
DEBUG:fsspec:... read: 0 - 8 
DEBUG:fsspec:... read: 0 - 8 
DEBUG:fsspec:... read: 0 - 48 
DEBUG:fsspec:... read: 48 - 560 
DEBUG:fsspec:... read: 0 - 8388608 
DEBUG:fsspec:... read: 16777216 - 25165824 
DEBUG:fsspec:... read: 0 - 8388608

But did you notice that I’m setting page_buf_size=2? How is the page buffer set when you don’t follow the guidance? Since I generally don’t know the page size ahead of time, is it good enough to set the page_buf_size to 1?

Version info:

  • hdf5 2.1.0
  • h5py 3.16.0
  • xarray 2026.4.0
  • fsspec 2026.4.0
  • s3fs 2026.4.0
1 Like

I am not sure if page buffering is enabled because the log output shows reading twice the first file page (read: 0 - 8388608).

You can check what the page buffer size is with something like this:

import xarray as xr

ds = xr.open_dataset(..., engine=“h5netcdf”, 
                     driver_kwds = {“page_buf_size”: 2}, 
                     open_kwargs={“cache_type”: “none”})
h5py_file = ds._file_obj.ds._h5file
fapl = h5py_file.id.get_access_plist()
page_buf_size = fapl.get_page_buffer_size()[0]

print(f"Current Page Buffer Cache Size: {page_buf_size} bytes")

The HDF5 library version used here comes with 64 MiB page buffer default size so it would be interesting to verify whether it works for the fsspec case also. It definitely works for ros3 driver (driver_kwds = {..., "driver": "ros3"}).

Actually, what may be happening is that the library chose the page buffer size equal to one file page. This would also explain why the first file page was read twice, because in between there was another file page that kicked the first page out of the buffer.

Below are results of my simple tests to explore the library’s page buffer size logic when using the ROS3 driver. Your case is equivalent to the “1 page - 10 bytes” case in the table. The actual page buffer was still one file page although the setting (“Requested”) was less than that.

File Paged? File page size Case Requested Actual Notes
typical no 4,096 Default 67,108,864 0 file not paged → buffering inactive
typical no 4,096 1 page - 10 8,388,598 0 file not paged → buffering inactive
typical no 4,096 1 page + 10 8,388,618 0 file not paged → buffering inactive
typical no 4,096 2 pages - 10 16,777,206 0 file not paged → buffering inactive
typical no 4,096 2 pages + 10 16,777,226 0 file not paged → buffering inactive
cloud optimized yes 8,388,608 Default 67,108,864 67,108,864 page buffering active
cloud optimized yes 8,388,608 1 page - 10 bytes 8,388,598 8,388,608 page buffering active
cloud optimized yes 8,388,608 1 page + 10 bytes 8,388,618 8,388,608 page buffering active
cloud optimized yes 8,388,608 2 pages - 10 bytes 16,777,206 8,388,608 page buffering active
cloud optimized yes 8,388,608 2 pages + 10 bytes 16,777,226 16,777,216 page buffering active

Thank you, Aleksandar.

A tip for you and any interested party using a more recent XArray version (since 2026.4.0 at least). With ds an xarray.Dataset, the ds._file_obj attribute referenced above does not exist, but you can still get the _h5file object via ds._close.__self__.ds._root._h5file.

I can confirm that with fsspec and the fileobj driver, the library choses a page buffer size equal to multiples of the page size. It’s much the same behavior you show using the ROS3 driver. A key difference between the drivers’ approaches to paged files currently seems to be:

  • fileobj driver does not read using pages at all, unless a positive page_buf_size is set
  • ros3 driver reads using pages, and defaults to a 64 MiB page buffer

In going through this exercise, I’ve realized that what might be most important thing to highlight is that, when given a path with the s3:// prefix or a file-like object, the xarray.open_dataset backend will not by default read paged files using pages (much less use a large page buffer). I think this is not a desirable status quo! If I track down an issue, or raise one … not sure where, I will post it here.