Fastest way to iterate a huge chunked file?

frank0734 · April 24, 2020, 10:24pm

Example:
dset.shape: 5000*10000*10000
dset.chunks: 100*100*100
operation: Iterate over all chunks, and take some random values from each.

Question: What is the fastest way to iterate all chunks, in terms of looping order and number of chunks each time, etc?

Currently, I am using something like:

# Iterate 1, one block a time
for iblock_z in range(50):
  for iblock_y in range(100):
    for iblock_x in range(100):
      data_on_block = dset[iblock_z*100 : (iblock_z+1)*100,
                                          iblock_y*100 : (iblock_y+1)*100,
                                          iblock_x*100 : (iblock_x+1)*100]
      # continue to other operations

I tried to read more blocks along the last dimension (x) and saw some speedup that is inconsistent between files/disks. Is the speedup real or just my false impression? Are the chunks stored contiguously, like of in a order (num_blocks, 100, 100, 100)?
What would be a good strategy if I use MPI parallel reading?

gheber · April 27, 2020, 12:34pm

It depends…

A few more questions:

What’s the element type?
Is this a one-off or will you be doing this repeatedly over the same dataset?
Are the chunks compressed (or otherwise filtered)
What’s the percentage of random values per chunk?

If you plan on doing this repeatedly, you could, the first time only , on rank 0, scan the dataset and retrieve the addresses H5Dget_chunk_info[_by_coord] of all chunks (and keep 'em for later…). You’d divide that list and broadcast the partitions. Depending on the random value ratio, you could read the whole chunk or offset into the chunk. (Yes, they are stored contiguously for fixed-size element types.)

G.