Efficient way to write large 4D datasets?

My recent experience (similar problem, except na=1)

suggests that, indeed, hdf5/mpiio performs poorly when faced with interleaving data from different ranks when/because nx/npx is small. It performs better when each rank writes larger contiguous sections of the file.

My experience is limited to a few machines and this specific case, and experts seem to be much more cautious about the advantages of writing larger contiguous sections. FWIW, I find that my code’s mpi communication is very fast compared with writing, and I can aggregate data in x (i.e., each row of npx ranks send all their data to one of those ranks) in a time negligible compared with the file write time. The increased contiguity of aggregated data then makes writing faster.

Memory is not an issue for my cases; I imagine hdf5/mpiio has to be more conservative and can’t blithely aggregate on one rank 32 times its own data (e.g., if npx=32).

I don’t know what hdf5/mpiio/lustre are doing under the hood in terms of getting data to aggregating nodes, etc. I’d be curious if there’s an advantage to determining the aggregating nodes that will write directly and communicating to them the data they need to write.

I haven’t been able to get good performance with chunking (but haven’t tried very hard); however, the advice to use H5D_FILL_TIME_NEVER did help significantly.

From my limited experience, I would guess your solution with aggregation and writing npz files would be as fast as possible (I’d certainly be curious to know if that turns out to be the case).

There are also some easier one-file alternatives:

Simply transposing the shape to [fortran order] (na, nx, ny, nz) would make each rank’s data more contiguous by a factor of na (if all na are written at once).

Writing npx different files (but with no aggregation) would result in each rank writing larger contiguous sections by a factor of ny/npy.

Or: if your domains are nearly equally in shape, a 7D array is easy to write, ensuring that each rank writes one contiguous section: instead of shape (nx, ny, nz, na), use shape (nx/nxp, ny/nyp, nz/nzp, na, nxp, nyp, nzp). I’ve tried this (6D, with na=1) and it works well, though some care is needed if the domains are not exactly the same size. I’ve tried this, and it’s always as least as fast as the aggregating scheme (but often not significantly faster; I use this for checkpointing, but not for data to be analyzed).

However, neither aggregation nor 6D array has reached the theoretical maximum write speed; I think there’s room for improvement.

Greg.