I’m working on a tool where the core functionality is passing data from one HDF5 file to another (so the layout can be changed) and optionally doing simple transformations on the data (such as scaling). The input file will likely have individual datasets larger than working memory, so it would be ideal if this operation could be done in chunks. The tool may ultimately be in compiled code, but I wanted to check h5py behavior first.
Running this MWE under a memory profiler, I had hoped to see that
- generation consumed as much memory as the full dataset,
- transferring from one file to another took “less” than the full dset size
- applying a chunked transformation took “less” than the full dset size
# The profile decorator requires running under # https://pypi.org/project/memory-profiler/ via `mprof run mwe.py` import h5py import numpy as np DIM = 1000 NDIM = 3 DSET_NAME = "x" FNAME_TO = "transfer_to.h5" FNAME_FROM = "transfer_from.h5" # @profile def generate() -> None: with h5py.File(FNAME_FROM, "w") as file_from: dset = file_from.create_dataset( name=DSET_NAME, data=np.random.random(size=[DIM for _ in range(NDIM)]), compression="gzip", chunks=tuple(100 for _ in range(NDIM)), ) # @profile def transfer() -> None: with h5py.File(FNAME_TO, "w") as file_to, h5py.File(FNAME_FROM, "r") as file_from: file_from.copy(source=file_from[DSET_NAME], dest=file_to) # @profile def transform() -> None: # can't transform during copy, so do it in-place with h5py.File(FNAME_TO, "r+") as file_to: dset = file_to[DSET_NAME] for s in dset.iter_chunks(): dset[s] += 2500.0 generate() transfer() transform()
According to the profile at https://github.com/berquist/h5py_tranform_mwe/blob/main/mprofile_20220831110323.pdf, the dset generation and transfer behave exactly as hoped, but the transformation doesn’t. Am I doing something wrong?