Fastest way to read arbitrary rows from big h5 file



I have dataset with 25 millions rows and 1000 columns. I can make it chunked (and compressed if needed) or contigous, anything to improve prefomance when reading say 5000 arbitrary rows.

For now I tried to make it chunked [5000, 1000], [1, 1000] or contigous and the prefomance was about 90-100 seconds to read 5000 arbitrary rows.

Of course if the rows read are consistent than the perfomance is much faster: 280 ms if layout is contigous and about 500 ms when layout is chunked.

Is there any recomendation for my case? Does parallelization may help?
I have to use HDD but free to choose Linux or Windows.

Upload 55 Gb hdf5 file to Kita Lab server

For anyone interested, this will be one of the topics in today’s Call the Doctor session: Call the Doctor - Weekly HDF Clinic.


I created this test: that tracks performance with arbitrary row selections.

Trying with a 100GB dataset, on SSD and HDD, the SSD version was 20x faster than the HDD one. So setting up an SSD drive seems like the best approach - granted they are a bit pricey for larger units.

I don’t think parallelization will help with HDD since in the end you are limited by the speed the seek head on the drive can move.