Optimal writing of regular field, spatial domain decomp, lustre

What is the fastet way to write a regular field from a spatially domain-decomposed simulation? What are the main considerations for fast writing?

E.g., a 3D array of 32-bit floats, representing the electron density in every cell of a regular 3D Cartesian grid, which is domain-decomposed across thousands of MPI ranks, such that each rank “owns” an equal-sized (3D) rectangle. Does it matter if the domain at the “top” of the simulation in each direction has an extra row of values? Or if the filesystem is lustre (writing different sections to different OSTs)?

I would have guessed that data could be communicated much faster through MPI than it could be written to disk, and that collective parallel hdf5 would communicate data so it’s in the place for fastest disk writing, and that communication would take negligible time compared with the disk I/O. A collective parallel write should get pretty close to the maximum I/O speed because MPI communication is practically free. Is that naive? If naive, is it because MPI communication is more complicated and time-consuming, or because hdf5/mpio doesn’t communicate to optimize writing, or?

Thanks,
Greg.

The correct answer is “It depends.” I think this Introduction to Scientific I/O has some good material.

G.

Thanks! That link does cover the specific case, “Balanced 3D grids” – they suggest (besides using collective parallel writes)
(1) chunking the dataset with chunks equal to (or a bit less than) the lustre stripe size (e.g., presumably >= 1 MB),
(2) aligning objects on the stripe size (i.e., each chunk is padded out to the stripe size), and
(3) Setting the b-tree size h5pset(fcpl, (stripe_size - 4096)/96)…where I think stripe_size is in bytes, though the hdf5 manual suggests that the maximum possible here is about stripe_size=3MB.

The example uses H5Pset_istore_k which manipulates the half rank of a chunk index B-tree. A value of (stripe_size - 4096)/96 doesn’t make sense on unit grounds because the half rank is not measured in bytes. H5Pset_meta_block_size is a much better candidate.

G.

Initial tests (TACC Frontera, 128 nodes, 4096 mpi ranks, 12 GB files, 8 lustre stripes per file, 1MB stripes) suggest that chunking in 1MB chunks increases the H5Dwrite time slightly, and the time spent in H5Dcreate, instead of being negligible, adds 20% - 50% of the (h5d)write time; and the file size increase to 14GB. Increasing H5Pset_meta_block_size and H5Pset_alignment did not seem to have significant effect.

I’ve only explored a small parameter space (though it included the NERSC-recommended parameter space), so I’m not sure whether I can conclude that this is not a promising avenue. And although I’m doing tests at 128 nodes to reduce costs, I really care about what happens on >=1024 nodes.

On the other hand, changing the array shape from a regular 3D array cell (i,j,k) to a 6D array where the first 3 indices are the domain index, and the second 3 the domain-local cell index (so that each mpi rank writes one continuous section of the file), sped up writing by a factor of 4 (without chunking).

Does the fact that chunking slowed things down, but reshaping the array gave a dramatic speed-up, tell me anything about what my I/O bottlenecks are/were?

Greg.

Greg, I’d suggest establishing a baseline for your app. before getting potentially confused by too many mixed messages. Do you have a Darshan profile for your app? That’ll tell you if there are any low-hanging fruit and what the long pole in the tent is. I don’t know anything about TACC Frontera and what its nodes/IO subsystem is capable of, but I’d guess that 32 ranks per node or fewer will saturate the available I/O bandwidth. Again, a baseline of what your system can deliver in reality is a good number to have.

G.

1 Like

I don’t think Darshan is available on Frontera, but I was able to use it on ALCF Theta (it doesn’t profile hdf5, but does show MPI-IO – am I missing something important?).

I’m not really sure what to look at in the darshan output. The only telling/surprising difference I spotted is that when chunking (1MB chunks), the MPIIO_BYTES_WRITTEN was roughly double the file size (whereas when not chunking, MPIIO_BYTES_WRITTEN equaled the file size).

In each case, I’m running on 128 nodes, with 2305 x 2305 x 577 grid cells, 32 mpi ranks/node (4096 ranks total). An output file is one 32-bit float per cell, about 12 GB. Each mpi domain has nearly the same shape. This is fortran, so x (the first dimension) is the memory-contiguous direction.

I’ve tried 4 different methods, on lustre with 8 stripes (the stripe size doesn’t matter much, but when chunking I didn’t explore anything other than 1MB stripes and 1MB chunks).

  • (1) Basic textbook hdf5 writing of a 3D array, no chunking
    • medium speed
  • (2) Basic 3D array with 1MB chunks, as suggested by NERSC (e.g., with alignment padding).
    • slow speed, and file size increases from 12GB to 14GB
  • (3) Writing a 6D array so that each rank outputs a single contiguous section of the file
    • fast
  • (4) Aggregating data to one rank per line of domains in the x direction. I.e., with a domain decomp of 16x16x16, each row of 16 domains in x sends to 1 of those domains, and that domain writes. Thus only 256 ranks participate in h5 calls, and each of those ranks writes larger contiguous sections (but unlike 3, the ranks write multiple contiguous sections).
    • as fast as (3)

Write times (for 12GB) on Frontera, domain decomp 8 x 32 x 8 = 4096 domains.

  • (3) and (4) are about the same, about 1.2 s for the main dataset write
    • This is 1.2 GB/s/OST and the rated max is 3.8 GB/s/OST.
  • (1) is substantially slower, about 3.8s.
    • However, if there’s only 1 domain along x, then the time drops to 1.2s.
  • (2) is something like twice as slow as (1).

On ALCF Theta, the difference between (1) and (3)/(4) is less.
For domain decomp 16 x 16 x 16 = 4096 domains

  • (4) about 5.5s (0.27 GB/s/OST; rated max is 11GB/s/OST ???)
  • (3) about 5.8s
  • (1) about 10.1s
  • (2) about that same as (1)

The Darshan output for (1), (3), and (4) is exactly what you’d expect. For (1) and (3), it’s almost exactly the same except for timing – it shows each rank performing 1 aggregated write of about 3 MB. For (4), it shoulds 256 ranks each performing 1 aggregated write of about 48MB.

The one surprising thing from darshan is that (2) [chunked] writes almost double the amount. I.e., MPIIO_BYTE_WRITTEN is almost double. There are twice as many MPIIO_VIEWS, and 8192 writes of size about 3MB.

This all seems consistent with the conclusion that (for balanced regular 3D grids):

(1) It’s worthwhile (likely a factor of about 2 or more in write speed) to make an effort so that ranks write larger contiguous sections of the file (e.g., changing the domain decomp, the file layout, or aggregating).

(2) Aggregate data can be done much faster than hdf5/mpiio does it. I’d be curious why this is (or if there’s some setting that will make it faster), but I’d guess I’m taking advantage of things that mpiio can’t or doesn’t. For example, I know that my simulation data greatly exceeds the data for one field, so I don’t have to worry about memory when aggregating one field from 4096 ranks to 256 ranks.

Greg.

You should use a dataset creation property list and set the fill time to never

hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
...
H5Pset_fill_time(dcpl, H5D_FILL_TIME_NEVER);
...
hid_t dset = H5Dcreate(..., dcpl, ...)

Otherwise, you’ll write data twice, fill-values when allocating chunks, and the “real” values on H5Dwrite.

G.

See also the section Allocation of Space in the File in the User Guide. G.

Thanks – that is very helpful to know.

Greg.