Parallel HDF5 Speed Bottleneck

Hello all,

I am writing a parallel modal decomposition software for CFD data in c++ and am using PHDF5 for file I/O running on a lustre filesystem. The transient data from the CFD is comprised of multiple variables anchored to an overset grid topology with 2 or 3 spatial dimensions. The N grids are non-uniform. In the H5 file each grid is allocated its own dataset so the result is a group of N 4/5-D datasets [2-3 space, solution, time]. Because our group is working on many different CFD projects, I have no basis for chunking the data since every grid collection is different from the last. Furthermore, the files I am working with can be 90 GB+ and my mem/proc is capped at 400 MB which leads to irregular hyperslabbing.

I’ve tried to read the file in two ways:

  1. All nodes read an equal amount of all datasets.
  2. All nodes read an equal amount of data, but the data is split as if it were a single consecutive dataset. Ie. not all collective reads are the same size, some can be 0.

I know that the overhead associated with a read can be a bottleneck so I am using option 2 because it usually results in less reads.

I’ve seen on here that people can read TB sized files rather swiftly but even with an adequate stripe count/size it can take me 10-20 mins to read my 90 GB file with 24 procs. Also, read time does not always decrease as I increase procs, sometimes resulting in drastically higher times at certain procs. I believe this is due to increased irregularity in hyperslabs as I have observed that reading a dataset in full is much quicker than splitting it up.

Does anyone have any insights on how to speed up the process? All help is much appreciated, and I’ll be happy to provide more details if need be. Thanks.

Ps. the reason I have not included any code is because it would take some time to collect all the partitions. I will get started on that soon but for now I just wanted to get the ball rolling.

Have you looked into using CGNS ( Most of our effort for improving parallel CFD I/O has been through improving the CGNS implementation.