Why increasing rdcc_nbytes and rdcc_nslots will result in a decrease in indexing performance?

farhoodjunkmail · October 16, 2021, 8:22pm

I am using h5py to create a dataset in python for an ML task. After researching about ways to improve HDF5 performance, many suggested using chunk storages + caching. But when I increase the rdcc_nbytes and rdcc_nslots values, the performance decreases noticeably to my surprise.

# Good performance, 1e5 slots, 4MB cache
h5_file = h5py.File(dpath, rdcc_nslots=1e5, rdcc_nbytes=4 * (1024**2), rdcc_w0=1)

# Bad performance, 1e7 slots, 4GB cache
h5_file = h5py.File(dpath, rdcc_nslots=1e7, rdcc_nbytes=4000 * (1024**2), rdcc_w0=1)

The chunk size is (100, 18, 1024).

Any explanations, or guides to how to choose these parameters adequetely? What should I focus on when I’m using cache_mem size. Shouldn’t more cache_mem size be better?

Another issue is that the second approach (4GB cache mem), will lead to a memory overflow. I have 12 GBs of free memory, and a 4GB cache should not cause my session to restart. I was wondering maybe these two issues are linked?

gheber · October 18, 2021, 9:44pm

Just a quick pointer for now: There is a neat RFC: Setting Raw Data Chunk Cache Parameters in HDF5 in our growing collection. It’s from 2008 and may not be 100% up-to-date, but probably a good starting point.

Best, G.

farhoodjunkmail · October 19, 2021, 6:16am

Hello and thanks!

I’ve also discovered that h5py has a memory leak issue currently, I’ve posted a comment on the github issue related to this, which also contributed to worsening my problem. I just comment here for people to also notice that too, if they had performance issues.

ajelenak · October 20, 2021, 2:14pm

Hi @farhoodjunkmail,

Can you please post which version of h5py and HDF5 library you are using? You can gather this information by executing:

 $ python -c "import h5py; print(h5py.version.info)"

-Aleksandar

farhoodjunkmail · October 21, 2021, 8:13am

@ajelenak
This is the colab example I’ve created to show what I mean.

The output of the h5py.version.info:

Summary of the h5py configuration
---------------------------------

h5py    3.5.0
HDF5    1.12.1
Python  3.7.12 (default, Sep 10 2021, 00:21:48) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.19.5
cython (built with) 0.29.24
numpy (built against) 1.14.5
HDF5 (built against) 1.12.1

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Why increasing rdcc_nbytes and rdcc_nslots will result in a decrease in indexing performance?