The h5py.File argument page_buf_size is confusing me, and since there’s a lot of enthusiasm for cloud-optimization of HDF5 files, I thought asking for a brief public explainer would be useful.
The docs say the value “must be a power of two value and greater or equal than the file space page size when creating the file”. And yet … if I violate that guidance, I can see that page buffering is enabled.
Here’s the h5stat -S on a test file:
Filename: s3://nasa-cryo-scratch/itcarroll/cloud-PACE/PACE_OCI_L2_AOP/8388608/G4194150056-OB_CLOUD
File space management strategy: H5F_FSPACE_STRATEGY_PAGE
File space page size: 8388608 bytes
Summary of file space information:
File metadata: 320158 bytes
Raw data: 18571705 bytes
Amount/Percent of tracked free space: 14658473 bytes/43.7%
Unaccounted space: 4096 bytes
Total space: 33554432 bytes
Note the page size is 2**23. When I track the fsspec logs, I see this nicely reflected!
xarray.open_dataset(..., engine=“h5netcdf”, driver_kwds = {“page_buf_size”: 2}, open_kwargs={“cache_type”: “none”})
DEBUG:fsspec:... read: 0 - 8
DEBUG:fsspec:... read: 0 - 8
DEBUG:fsspec:... read: 0 - 48
DEBUG:fsspec:... read: 48 - 560
DEBUG:fsspec:... read: 0 - 8388608
DEBUG:fsspec:... read: 16777216 - 25165824
DEBUG:fsspec:... read: 0 - 8388608
But did you notice that I’m setting page_buf_size=2? How is the page buffer set when you don’t follow the guidance? Since I generally don’t know the page size ahead of time, is it good enough to set the page_buf_size to 1?
Version info:
- hdf5 2.1.0
- h5py 3.16.0
- xarray 2026.4.0
- fsspec 2026.4.0
- s3fs 2026.4.0
1 Like
I am not sure if page buffering is enabled because the log output shows reading twice the first file page (read: 0 - 8388608).
You can check what the page buffer size is with something like this:
import xarray as xr
ds = xr.open_dataset(..., engine=“h5netcdf”,
driver_kwds = {“page_buf_size”: 2},
open_kwargs={“cache_type”: “none”})
h5py_file = ds._file_obj.ds._h5file
fapl = h5py_file.id.get_access_plist()
page_buf_size = fapl.get_page_buffer_size()[0]
print(f"Current Page Buffer Cache Size: {page_buf_size} bytes")
The HDF5 library version used here comes with 64 MiB page buffer default size so it would be interesting to verify whether it works for the fsspec case also. It definitely works for ros3 driver (driver_kwds = {..., "driver": "ros3"}).
Actually, what may be happening is that the library chose the page buffer size equal to one file page. This would also explain why the first file page was read twice, because in between there was another file page that kicked the first page out of the buffer.
Below are results of my simple tests to explore the library’s page buffer size logic when using the ROS3 driver. Your case is equivalent to the “1 page - 10 bytes” case in the table. The actual page buffer was still one file page although the setting (“Requested”) was less than that.
| File |
Paged? |
File page size |
Case |
Requested |
Actual |
Notes |
| typical |
no |
4,096 |
Default |
67,108,864 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
1 page - 10 |
8,388,598 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
1 page + 10 |
8,388,618 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
2 pages - 10 |
16,777,206 |
0 |
file not paged → buffering inactive |
| typical |
no |
4,096 |
2 pages + 10 |
16,777,226 |
0 |
file not paged → buffering inactive |
| cloud optimized |
yes |
8,388,608 |
Default |
67,108,864 |
67,108,864 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
1 page - 10 bytes |
8,388,598 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
1 page + 10 bytes |
8,388,618 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
2 pages - 10 bytes |
16,777,206 |
8,388,608 |
page buffering active |
| cloud optimized |
yes |
8,388,608 |
2 pages + 10 bytes |
16,777,226 |
16,777,216 |
page buffering active |
Thank you, Aleksandar.
A tip for you and any interested party using a more recent XArray version (since 2026.4.0 at least). With ds an xarray.Dataset, the ds._file_obj attribute referenced above does not exist, but you can still get the _h5file object via ds._close.__self__.ds._root._h5file.
I can confirm that with fsspec and the fileobj driver, the library choses a page buffer size equal to multiples of the page size. It’s much the same behavior you show using the ROS3 driver. A key difference between the drivers’ approaches to paged files currently seems to be:
- fileobj driver does not read using pages at all, unless a positive page_buf_size is set
- ros3 driver reads using pages, and defaults to a 64 MiB page buffer
In going through this exercise, I’ve realized that what might be most important thing to highlight is that, when given a path with the s3:// prefix or a file-like object, the xarray.open_dataset backend will not by default read paged files using pages (much less use a large page buffer). I think this is not a desirable status quo! If I track down an issue, or raise one … not sure where, I will post it here.