I don’t think Darshan is available on Frontera, but I was able to use it on ALCF Theta (it doesn’t profile hdf5, but does show MPI-IO – am I missing something important?).
I’m not really sure what to look at in the darshan output. The only telling/surprising difference I spotted is that when chunking (1MB chunks), the MPIIO_BYTES_WRITTEN was roughly double the file size (whereas when not chunking, MPIIO_BYTES_WRITTEN equaled the file size).
In each case, I’m running on 128 nodes, with 2305 x 2305 x 577 grid cells, 32 mpi ranks/node (4096 ranks total). An output file is one 32-bit float per cell, about 12 GB. Each mpi domain has nearly the same shape. This is fortran, so x (the first dimension) is the memory-contiguous direction.
I’ve tried 4 different methods, on lustre with 8 stripes (the stripe size doesn’t matter much, but when chunking I didn’t explore anything other than 1MB stripes and 1MB chunks).
- (1) Basic textbook hdf5 writing of a 3D array, no chunking
- (2) Basic 3D array with 1MB chunks, as suggested by NERSC (e.g., with alignment padding).
- slow speed, and file size increases from 12GB to 14GB
- (3) Writing a 6D array so that each rank outputs a single contiguous section of the file
- (4) Aggregating data to one rank per line of domains in the x direction. I.e., with a domain decomp of 16x16x16, each row of 16 domains in x sends to 1 of those domains, and that domain writes. Thus only 256 ranks participate in h5 calls, and each of those ranks writes larger contiguous sections (but unlike 3, the ranks write multiple contiguous sections).
Write times (for 12GB) on Frontera, domain decomp 8 x 32 x 8 = 4096 domains.
- (3) and (4) are about the same, about 1.2 s for the main dataset write
- This is 1.2 GB/s/OST and the rated max is 3.8 GB/s/OST.
- (1) is substantially slower, about 3.8s.
- However, if there’s only 1 domain along x, then the time drops to 1.2s.
- (2) is something like twice as slow as (1).
On ALCF Theta, the difference between (1) and (3)/(4) is less.
For domain decomp 16 x 16 x 16 = 4096 domains
- (4) about 5.5s (0.27 GB/s/OST; rated max is 11GB/s/OST ???)
- (3) about 5.8s
- (1) about 10.1s
- (2) about that same as (1)
The Darshan output for (1), (3), and (4) is exactly what you’d expect. For (1) and (3), it’s almost exactly the same except for timing – it shows each rank performing 1 aggregated write of about 3 MB. For (4), it shoulds 256 ranks each performing 1 aggregated write of about 48MB.
The one surprising thing from darshan is that (2) [chunked] writes almost double the amount. I.e., MPIIO_BYTE_WRITTEN is almost double. There are twice as many MPIIO_VIEWS, and 8192 writes of size about 3MB.
This all seems consistent with the conclusion that (for balanced regular 3D grids):
(1) It’s worthwhile (likely a factor of about 2 or more in write speed) to make an effort so that ranks write larger contiguous sections of the file (e.g., changing the domain decomp, the file layout, or aggregating).
(2) Aggregate data can be done much faster than hdf5/mpiio does it. I’d be curious why this is (or if there’s some setting that will make it faster), but I’d guess I’m taking advantage of things that mpiio can’t or doesn’t. For example, I know that my simulation data greatly exceeds the data for one field, so I don’t have to worry about memory when aggregating one field from 4096 ranks to 256 ranks.