How does caching work and how do I reduce disk i/o usage?

I’m using the h5py library to train a deep learning model in Pytorch. I use 20 workers to load batches of 4096 samples in parallel. I’m running my script on 2 different machines and are seeing 2 different behaviors from the h5py package which I’m trying to understand.

On mahine 1, the disk i/o starts off at 500 MB/s and slowly goes to ~0. In this case, I am able to load 4096*20 ~ 100,000 rows using h5py each second.

On machine 2, the disk i/o stays at 400 MB/s and I’m only able to load 4096*5 ~ 20,000 rows per second.

I suspect caching is going on, that’s why machine 1 is able to use close to 0 disk i/o in the limit. I’m using the default settings for h5py.

I read the documentation, it seemed like “chunk caching” is the most relevant topic but it wasn’t clear where that cache is being stored. I see minimal RAM usage while my script is running, definitely not enough usage to cache the entire hdf5 file.

If I wanted to also get fast 100,000/s read on machine 2, what do I need to do? Do I need to increase the RAM?

Update: I’m now quite sure the difference in machines 1 and 2 is due to RAM. Why is hdf5 silently using RAM (I can’t see usage in “watch free”)? Is there any way to get faster throughput without expecting there to be enough RAM to store the entire dataset?

I am curious how many event’s per sec you can do with python, and how it scales… In any ‘event’ here is a recent slide on C++ which can do sustained throughput in the ball park of the underlying file system.