I am using
h5py to create a dataset in python for an ML task. After researching about ways to improve HDF5 performance, many suggested using chunk storages + caching. But when I increase the rdcc_nbytes and rdcc_nslots values, the performance decreases noticeably to my surprise.
# Good performance, 1e5 slots, 4MB cache h5_file = h5py.File(dpath, rdcc_nslots=1e5, rdcc_nbytes=4 * (1024**2), rdcc_w0=1) # Bad performance, 1e7 slots, 4GB cache h5_file = h5py.File(dpath, rdcc_nslots=1e7, rdcc_nbytes=4000 * (1024**2), rdcc_w0=1)
The chunk size is (100, 18, 1024).
Any explanations, or guides to how to choose these parameters adequetely? What should I focus on when I’m using cache_mem size. Shouldn’t more cache_mem size be better?
Another issue is that the second approach (4GB cache mem), will lead to a memory overflow. I have 12 GBs of free memory, and a 4GB cache should not cause my session to restart. I was wondering maybe these two issues are linked?