I have dataset with 25 millions rows and 1000 columns. I can make it chunked (and compressed if needed) or contigous, anything to improve prefomance when reading say 5000 arbitrary rows.
For now I tried to make it chunked [5000, 1000], [1, 1000] or contigous and the prefomance was about 90-100 seconds to read 5000 arbitrary rows.
Of course if the rows read are consistent than the perfomance is much faster: 280 ms if layout is contigous and about 500 ms when layout is chunked.
Is there any recomendation for my case? Does parallelization may help?
I have to use HDD but free to choose Linux or Windows.
Trying with a 100GB dataset, on SSD and HDD, the SSD version was 20x faster than the HDD one. So setting up an SSD drive seems like the best approach - granted they are a bit pricey for larger units.
I don’t think parallelization will help with HDD since in the end you are limited by the speed the seek head on the drive can move.