HDF5 split file with MPI I/O; and help for reading many small datasets

Thank you for clearing that up.

We’d like to store digital reconstructions of neurons, which we’ll call morphologies. Imagine a graph of cylinders. Each morphology consists of one dataset of “points” shape=(N, 4) which stores the start and end point of each cylinder and the radius of the “cylinder” at each end point. You can absolutely ignore the fact that these are variable radius “cylinders” (I will from now on). The other dataset is a list of integer offsets of each branch (a branch is a sequence of cylinders in the graph without branching).

More to the point, we have N groups that look like this:

$ h5dump -g "000/00295" -H morphologies.h5
      GROUP "000" {
         GROUP "00295" {
            DATASET "points" {
               DATATYPE  H5T_IEEE_F32LE
               DATASPACE  SIMPLE { ( 6734, 4 ) / ( 6734, 4 ) }
            }
            DATASET "structure" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 217, 3 ) / ( 217, 3 ) }
            }
         }
      }

The size of these files depends on the number of morphologies, which depends on the size of the region of interest. We’re testing 1k, 10k, 100k, 1M and 10M. We’d like to reach (just under) 100M. The 100k file is 19GB, which means about 200kB per group (on average).

File formats we’re investigating:

  • Vanilla H5F_LIBVER_V110,
  • Page allocated files with 16kb page sizes.

Thanks to Elena Pourmal and John Mainzer for suggesting page allocated files. As a baseline for comparison (and our current solution) we’re also storing each group in it’s own HDF5 file, with the same folder structure as in the HDF5 files described previously.

As for access patterns, we’re interested in reading either the entire file or a subset of the groups once. There’s two patterns:

  • random access of a subset,
  • optimal order.

The precise semantics of “optimal order” can be decided. I currently believe it is order in which the groups were created. Since that should maximize metadata reuse and access the file (mostly) in order.

Since I don’t need to get split file working, I’d like to finish the measurements over a reasonable range of parameters; and report back with reliable numbers.

However, what we’re observing is that for a 100k file, randomly accessing 10k morphologies is roughly 2x slower compared to the baseline of loading the same 10k morphologies when each morphology is stored in its own file. Accessing the first 10k morphologies in optimal order is roughly 10x faster than the baseline. With is from one MPI rank.

For concurrent read performance, to us the new feature of reading multiple datasets collectively, i.e. H5Dread_multi, appears to hold some promise (especially when accessing large fractions of the file). Would you agree it’s worth pursuing that route?