Fastest way to read arbitrary rows from big h5 file

kerim.khemraev · September 4, 2022, 4:45pm

Hi,

I have dataset with 25 millions rows and 1000 columns. I can make it chunked (and compressed if needed) or contigous, anything to improve prefomance when reading say 5000 arbitrary rows.

For now I tried to make it chunked [5000, 1000], [1, 1000] or contigous and the prefomance was about 90-100 seconds to read 5000 arbitrary rows.

Of course if the rows read are consistent than the perfomance is much faster: 280 ms if layout is contigous and about 500 ms when layout is chunked.

Is there any recomendation for my case? Does parallelization may help?
I have to use HDD but free to choose Linux or Windows.

jreadey · September 13, 2022, 3:47pm

For anyone interested, this will be one of the topics in today’s Call the Doctor session: Call the Doctor - Weekly HDF Clinic.

jreadey · September 13, 2022, 10:17pm

I created this test: https://github.com/HDFGroup/hsds/blob/master/tests/perf/read/read2d.py that tracks performance with arbitrary row selections.

Trying with a 100GB dataset, on SSD and HDD, the SSD version was 20x faster than the HDD one. So setting up an SSD drive seems like the best approach - granted they are a bit pricey for larger units.

I don’t think parallelization will help with HDD since in the end you are limited by the speed the seek head on the drive can move.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Fastest way to read arbitrary rows from big h5 file