The h5py.File argument page_buf_size is confusing me, and since there’s a lot of enthusiasm for cloud-optimization of HDF5 files, I thought asking for a brief public explainer would be useful.
The docs say the value “must be a power of two value and greater or equal than the file space page size when creating the file”. And yet … if I violate that guidance, I can see that page buffering is enabled.
Here’s the h5stat -S on a test file:
Filename: s3://nasa-cryo-scratch/itcarroll/cloud-PACE/PACE_OCI_L2_AOP/8388608/G4194150056-OB_CLOUD
File space management strategy: H5F_FSPACE_STRATEGY_PAGE
File space page size: 8388608 bytes
Summary of file space information:
File metadata: 320158 bytes
Raw data: 18571705 bytes
Amount/Percent of tracked free space: 14658473 bytes/43.7%
Unaccounted space: 4096 bytes
Total space: 33554432 bytes
Note the page size is 2**23. When I track the fsspec logs, I see this nicely reflected!
xarray.open_dataset(..., engine=“h5netcdf”, driver_kwds = {“page_buf_size”: 2}, open_kwargs={“cache_type”: “none”})
DEBUG:fsspec:... read: 0 - 8
DEBUG:fsspec:... read: 0 - 8
DEBUG:fsspec:... read: 0 - 48
DEBUG:fsspec:... read: 48 - 560
DEBUG:fsspec:... read: 0 - 8388608
DEBUG:fsspec:... read: 16777216 - 25165824
DEBUG:fsspec:... read: 0 - 8388608
But did you notice that I’m setting page_buf_size=2? How is the page buffer set when you don’t follow the guidance? Since I generally don’t know the page size ahead of time, is it good enough to set the page_buf_size to 1?
Version info:
- hdf5 2.1.0
- h5py 3.16.0
- xarray 2026.4.0
- fsspec 2026.4.0
- s3fs 2026.4.0
1 Like
I am not sure if page buffering is enabled because the log output shows reading twice the first file page (read: 0 - 8388608).
You can check what the page buffer size is with something like this:
import xarray as xr
ds = xr.open_dataset(..., engine=“h5netcdf”,
driver_kwds = {“page_buf_size”: 2},
open_kwargs={“cache_type”: “none”})
h5py_file = ds._file_obj.ds._h5file
fapl = h5py_file.id.get_access_plist()
page_buf_size = fapl.get_page_buffer_size()[0]
print(f"Current Page Buffer Cache Size: {page_buf_size} bytes")
The HDF5 library version used here comes with 64 MiB page buffer default size so it would be interesting to verify whether it works for the fsspec case also. It definitely works for ros3 driver (driver_kwds = {..., "driver": "ros3"}).
Actually, what may be happening is that the library chose the page buffer size equal to one file page. This would also explain why the first file page was read twice, because in between there was another file page that kicked the first page out of the buffer.
Below are results of my simple tests to explore the library’s page buffer size logic when using the ROS3 driver. Your case is equivalent to the “1 page - 10 bytes” case in the table. The actual page buffer was still one file page although the setting (“Requested”) was less than that.
| File |
Paged? |
File page size |
Case |
Requested |
Actual |
Notes |
| typical |
no |
4,096 |
Default |
67,108,864 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
1 page - 10 |
8,388,598 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
1 page + 10 |
8,388,618 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
2 pages - 10 |
16,777,206 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
2 pages + 10 |
16,777,226 |
0 |
file not paged → buffering inactive |
| cloud optimized |
yes |
8,388,608 |
Default |
67,108,864 |
67,108,864 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
1 page - 10 bytes |
8,388,598 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
1 page + 10 bytes |
8,388,618 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
2 pages - 10 bytes |
16,777,206 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
2 pages + 10 bytes |
16,777,226 |
16,777,216 |
page buffering active |