Memory Leak while Reading Data?


#1

I’ve got about 100,000 HDF5 files I’m trying to read using h5py; one of the datasets is a Compound data type that I’m reading just a few bytes from, converting it to a string using codecs.encode, appending to a dictionary, then closing the file and moving on. As I’m reading files, my code progressively gets slower per file, even if I don’t append those to the dictionary and just open, read, and close.This only happens on my Mac (a 2017 Macbook Pro) and not on Windows, regardless of using h5py 2.X or 3.X. Is there something I should be doing differently to prevent this from happening on my Mac?

Code example I’m using to diagnose the problem is (with the ‘/Record/Labels/Values/’ dataset being the one I’m trying to read):

import h5py as h5
import glob
import codecs
import time


def makeLabelDict(filename):
    with h5.File(filename, 'r') as file:
        Labels = file['Record']['Labels']
        AllLabelNames = Labels['Names'][:]
        AllLabelValues = Labels['Values'][:]

    AllNames = [codecs.decode(name[0], 'UTF-8') for name in AllLabelNames]
    label_dict = {}
    for name in AllNames:
        labelValues = AllLabelValues[name]
        label_dict[name] = codecs.decode(labelValues[0][0], 'UTF-8')

    return label_dict


folder = './Records/'

filelist = glob.glob(folder + '*.h5')

allFileInfo = []

start = time.time()
for i, filename in enumerate(filelist[0:]):
    label_dict = makeLabelDict(filename)

    allFileInfo.append(label_dict)
    # kick out the elapsed time per 1,000 files
    if i / 10000 == i // 10000:
        print(i)
        print(time.time() - start)
        start = time.time()

and the output (which should kick out the time to read each 10,000 files) is:
0
0.00923013687133789
10000
54.41268610954285
20000
81.57614088058472
30000
109.91995120048523
40000
137.28819823265076
50000
168.89158034324646
60000
197.9121537208557
70000
227.07507610321045
80000
250.14627695083618
90000
275.6847689151764

I’m not sure if this is something I’m doing wrong in my code, something unique to my machine, or an issue with the package, but any suggestions would be welcome!


#2

The markdown/indentation is a little garbled, but I don’t see anything wrong with your code. The with statement should take care of open HDF5 handles. Presumably, there isn’t any kind of lazy evaluation, clinging on to resources, going on here. How fast is the memory footprint growing and is there any way to see what it’s being used for?

G.


#3

@jessop5 Python comes with a gc module that provides an interface to the garbage collector. I’d suggest you to import it and place a call to gc.collect() after the with statement block. That should help you assess whether you have a memory leak or not. Good luck!