Chunked transformation using full dset size in memory?


#1

I’m working on a tool where the core functionality is passing data from one HDF5 file to another (so the layout can be changed) and optionally doing simple transformations on the data (such as scaling). The input file will likely have individual datasets larger than working memory, so it would be ideal if this operation could be done in chunks. The tool may ultimately be in compiled code, but I wanted to check h5py behavior first.

Running this MWE under a memory profiler, I had hoped to see that

  • generation consumed as much memory as the full dataset,
  • transferring from one file to another took “less” than the full dset size
  • applying a chunked transformation took “less” than the full dset size
# The profile decorator requires running under
# https://pypi.org/project/memory-profiler/ via `mprof run mwe.py`

import h5py
import numpy as np

DIM = 1000
NDIM = 3
DSET_NAME = "x"

FNAME_TO = "transfer_to.h5"
FNAME_FROM = "transfer_from.h5"


# @profile
def generate() -> None:
    with h5py.File(FNAME_FROM, "w") as file_from:
        dset = file_from.create_dataset(
            name=DSET_NAME,
            data=np.random.random(size=[DIM for _ in range(NDIM)]),
            compression="gzip",
            chunks=tuple(100 for _ in range(NDIM)),
        )


# @profile
def transfer() -> None:
    with h5py.File(FNAME_TO, "w") as file_to, h5py.File(FNAME_FROM, "r") as file_from:
        file_from.copy(source=file_from[DSET_NAME], dest=file_to)


# @profile
def transform() -> None:
    # can't transform during copy, so do it in-place
    with h5py.File(FNAME_TO, "r+") as file_to:
        dset = file_to[DSET_NAME]
        for s in dset.iter_chunks():
            dset[s] += 2500.0


generate()
transfer()
transform()

According to the profile at https://github.com/berquist/h5py_tranform_mwe/blob/main/mprofile_20220831110323.pdf, the dset generation and transfer behave exactly as hoped, but the transformation doesn’t. Am I doing something wrong?


#2

You’re not doing anything obviously wrong. I’ve just run your code (the only change was reducing DIM a bit), and I see a graph like this, which I imagine is what you’re aiming for:

Check what version of h5py and HDF5 you’re using, like this:

>>> import h5py
>>> print(h5py.version.info)
Summary of the h5py configuration
---------------------------------

h5py    3.7.0
HDF5    1.12.2
Python  3.10.7 (main, Sep  7 2022, 00:00:00) [GCC 12.2.1 20220819 (Red Hat 12.2.1-1)]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.21.4
cython (built with) 0.29.30
numpy (built against) 1.21.6
HDF5 (built against) 1.12.2