I am using h5py to create a dataset in python for an ML task. After researching about ways to improve HDF5 performance, many suggested using chunk storages + caching. But when I increase the rdcc_nbytes and rdcc_nslots values, the performance decreases noticeably to my surprise.
Any explanations, or guides to how to choose these parameters adequetely? What should I focus on when I’m using cache_mem size. Shouldn’t more cache_mem size be better?
Another issue is that the second approach (4GB cache mem), will lead to a memory overflow. I have 12 GBs of free memory, and a 4GB cache should not cause my session to restart. I was wondering maybe these two issues are linked?
I’ve also discovered that h5py has a memory leak issue currently, I’ve posted a comment on the github issue related to this, which also contributed to worsening my problem. I just comment here for people to also notice that too, if they had performance issues.