Large memory consumption when loading many data sets

We want to load arrays from multiple (>=10) hdf5 files, where each file contains a large number of data sets (~300).
When loading the data, the amount of allocated memory corresponds to the data set size, as expected.
However, wehen merging the content of the individual files by concatenating the corresponding arrays and then deleting the original arrays, the consumed memory is much larger than the data sample size.
In contrast, when loading the same amount of data, but distributed at a smaller number of data sets, the amount of allocated memory after the concatenation is similar to the data sample size as expected.
This may be a memory leak or problem with the Python garbage collection.
This problem does not appear when generating the numpy arrays within the code instead of loading them via the h5py library, probably excluding a problem with numpy.
Would you have any suggestion on how to solve this problem?

Unfortunately, I cannot upload attachments here, so I post the following screen shotps


image

image

Hi @stefan,

what does python3 -c 'import h5py; print(h5py.version.info)' give for output? I copied your first set of steps into a test program (attached here so you can review and make sure I didn’t miss anything) and got the following outputs:
test.py (1.2 KB)

4.470348358154297 GiB
4.47161865234375 GiB
8.942760467529297 GiB
4.918529510498047 GiB

So, it may be that this is an issue that was fixed, either in HDF5 or h5py. For reference, this is with:

Summary of the h5py configuration
---------------------------------

h5py    3.11.0
HDF5    1.15.0
Python  3.9.17 (main, Jun  8 2023, 14:52:17) 
[GCC 13.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.3
cython (built with) 3.0.10
numpy (built against) 2.0.0rc1
HDF5 (built against) 1.15.0

Thanks for the quick reply @jhenderson

Gives

---------------------------------

h5py    3.10.0
HDF5    1.14.2
Python  3.9.18 (main, Jan  4 2024, 00:00:00)
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.24.3
cython (built with) 0.29.36
numpy (built against) 1.19.3
HDF5 (built against) 1.14.2

When I run your test script I get the following output, which seems to be better then the result in the jupyter notebook, but still shows the problem.

$ python test.py
4.470348358154297 GiB
4.47119140625 GiB
8.942459106445312 GiB
5.8122406005859375 GiB

Interesting, I wonder if this isn’t just a difference in something within the jupyter notebook. I rebuilt h5py against HDF5 1.14.3 and 1.14.2 and got the same results, though I’d double-check to make sure I didn’t miss anything when copying your steps over to the test script:

Summary of the h5py configuration
---------------------------------

h5py    3.11.0
HDF5    1.14.3
Python  3.9.17 (main, Jun  8 2023, 14:52:17) 
[GCC 13.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.3
cython (built with) 3.0.10
numpy (built against) 2.0.0rc1
HDF5 (built against) 1.14.3

bash-5.2$ python3 test.py 
4.470348358154297 GiB
4.471435546875 GiB
8.942684173583984 GiB
4.91845703125 GiB
Summary of the h5py configuration
---------------------------------

h5py    3.11.0
HDF5    1.14.2
Python  3.9.17 (main, Jun  8 2023, 14:52:17) 
[GCC 13.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.3
cython (built with) 3.0.10
numpy (built against) 2.0.0rc1
HDF5 (built against) 1.14.2

bash-5.2$ python3 test.py 
4.470348358154297 GiB
4.47015380859375 GiB
8.941429138183594 GiB
4.915702819824219 GiB

I’m not sure what memory consumption relates to in case of a Juputer notebook. You have a web app (Jupyter notebook instance) that talks to another process running the Python (ipykernel) process. So it gets complicated…

Jupyter notebooks are really cool but I would be very careful to report problem straight to the h5py/libhdf5 level.

-Aleksandar

I agree that the situation in the Jupyter notebook may be a little bit more complicated.
However, the numbers I quoted in this post (Large memory consumption when loading many data sets - #3 by stefan) above where obtained calling the test.py script from @jhenderson in a shell, i.e. python3 test.py. There I still see some additional memory usage in the last printout, which is, however, not as large as in the case of the jupyter notebook.