Efficient way to write large 4D datasets?

Hello,

I am designing a new IO module of an existing CFD & MHD code (Pencil Code). We are in Fortran, so the first dimension is the fastest (= memory aligned) in our data arrays.

Our in-memory data representation is a large 4D array (size: nx×ny×nz×na), where the first three dimensions are the spatial extent of each processor’s subdomain (nx×ny×nz) and the fourth dimension (na) is an index of a physical quantity, which could be any number from 4 to 20, depending on the physical setup. A physical setup with npx×npy×npz processors along the three spatial directions then has a 4D size of ngx×ngy×ngz×na, where ngx = nx×npx.

We do already have one parallel HDF5 IO module, which writes large monolithic files and each physical quantity is stored in a separate 3D HDF5 dataset (an array). Unfortunately, this is slow for large setups (> 1024 processors), presumably because the combination of the data into the file representation means a lot of copying of individual x-aligned lines from the in-memory representation…

I am now thinking on new possible IO strategies to make our code’s write routines faster. In particular I am thinking of collecting along the x and y directions and create npz HDF5 files, so that output can be faster. Furthermore, I am thinking if writing the whole block of each processor’s data in its memory-aligned way into the HDF5 files would make our output faster. We would then have a different representation in the npz separate files with one 4D array of the size ngx×ngy×nz×na.

What are your thoughts on this? Would this approach significantly reduce our writing times?
Otherwise, which other proposals for IO strategies would you suggest?

Regarding chunking: So far, we do not use chunking, because we need to store also “ghost cells”, which means the outer/boundary processor’s subdomains have a size that is not equal to inner processors. But I am also thinking of separating the ghost cells into other datasets, so that we can start using chunking (and later also compression).

Thank you and best greetings,
Philippe.

The data partitioned described in this case is often referred to as the
checker-board pattern, which performs well if the MPI collective I/O mode
is enabled. Just being curious, is the collective I/O mode enabled in your
case? It can be enabled through API H5Pset_dxpl_mpio

In addition, the number of datasets to be written appears to be na.
HDF5 1.14.0 introduces a new feature to write multiple datasets in
a single write call, e.g. H5Dwrite_multi, which may also boost your write
performance.

@wkliao
Yes, we have already activated H5Pset_dxpl_mpio:

call h5pset_dxpl_mpio_f (h5_plist, H5FD_MPIO_COLLECTIVE_F, h5_err)

Here you can have a look on our HDF5 IO code:

I am not sure if we can already use HDF5 1.14.0+ features, because many computing centers still use something like the 1.13 versions.

H5Dwrite_multi […] may also boost your write performance.

Just being curious: How would this routine solve the problem that our data is memory-aligned on each processor’s subdomain but needs to be rearranged for the monolithic file representation?

Without the multi-dataset APIs, you will have to make H5Dwrite calls, one per each of na datasets (one after another). H5Dwrite_multi can aggregate
the write requests into a single or fewer MPI-IO calls on each process,
which can improve the performance.

Two of a few factors affecting the parallel I/O performance are:

  1. Is a parallel file system used? What are the file striping settings.
  2. What are the sizes of ngx, ngy, and ngz?

How slow was the write time? What is the expected write performance
in your system?

My recent experience (similar problem, except na=1)


suggests that, indeed, hdf5/mpiio performs poorly when faced with interleaving data from different ranks when/because nx/npx is small. It performs better when each rank writes larger contiguous sections of the file.

My experience is limited to a few machines and this specific case, and experts seem to be much more cautious about the advantages of writing larger contiguous sections. FWIW, I find that my code’s mpi communication is very fast compared with writing, and I can aggregate data in x (i.e., each row of npx ranks send all their data to one of those ranks) in a time negligible compared with the file write time. The increased contiguity of aggregated data then makes writing faster.

Memory is not an issue for my cases; I imagine hdf5/mpiio has to be more conservative and can’t blithely aggregate on one rank 32 times its own data (e.g., if npx=32).

I don’t know what hdf5/mpiio/lustre are doing under the hood in terms of getting data to aggregating nodes, etc. I’d be curious if there’s an advantage to determining the aggregating nodes that will write directly and communicating to them the data they need to write.

I haven’t been able to get good performance with chunking (but haven’t tried very hard); however, the advice to use H5D_FILL_TIME_NEVER did help significantly.

From my limited experience, I would guess your solution with aggregation and writing npz files would be as fast as possible (I’d certainly be curious to know if that turns out to be the case).

There are also some easier one-file alternatives:

Simply transposing the shape to [fortran order] (na, nx, ny, nz) would make each rank’s data more contiguous by a factor of na (if all na are written at once).

Writing npx different files (but with no aggregation) would result in each rank writing larger contiguous sections by a factor of ny/npy.

Or: if your domains are nearly equally in shape, a 7D array is easy to write, ensuring that each rank writes one contiguous section: instead of shape (nx, ny, nz, na), use shape (nx/nxp, ny/nyp, nz/nzp, na, nxp, nyp, nzp). I’ve tried this (6D, with na=1) and it works well, though some care is needed if the domains are not exactly the same size. I’ve tried this, and it’s always as least as fast as the aggregating scheme (but often not significantly faster; I use this for checkpointing, but not for data to be analyzed).

However, neither aggregation nor 6D array has reached the theoretical maximum write speed; I think there’s room for improvement.

Greg.

@wkliao
First, our large production runs are almost exclusively running on HPC systems and supercomputers, so: 1) Yes, we are running on parallel file systems, such as Lustre, GPFS, or BeeGFS. But as we run on different computing centers, we can not predict the striping settings and in some cases we can not even find that out ourselves.
2) typical sizes of our global grid are 512^3 at minimum, typically 1024^3, but it could also be 4096^3.

We see that our old IO strategy (distributed raw binary writes from each processor in npx×npy×npz separate files) are about 4 to 10 times faster. I strongly suspect the rearrangement of the data from in-memory representation to a monolithic file representation as one bottleneck. Can someone confirm this slows the IO?

Or is HDF5 internally writing each processor’s data in separate spaces, so that no rearrangement of the data is done?

I understand that HDF5write_multi reduces the MPI overhead by reducing the number of communications, though I doubt this would avoid the rearrangement of the data while writing to monolithic files. Therefore I am wondering if someone agrees that storing the data in a different format (npz blocks of ngx×ngy×nz×na) would be better in IO performance, as this would not require to rearrange and data from the per-processor’s memory alignment.

MPI-IO uses a strategy called two-phase I/O. For writes, the first phase,
referred to as the communication phase, re-arranges write data from all
processes to a subset of processes, referred to as I/O aggregators.
The 2nd phase is the I/O phase where the aggregators write to the file
system. The communication phase can be expensive.

Your old approach of storing npz blocks of ngx×ngy×nz×na will still
bear such communication cost (but maybe less), because ngx and ngy
dimensions are partitioned among processes.

On Lustre, you can increase the file striping count of the output folder
(use command lfs). The default setting may be just 1. Command
‘lfs getstripe filename’ can display the settings.
On GPFS, the settings are not user configurable.

~The grid sizes in your case are large enough to get benefits from the
use of H5Dwrite_multi.~
Correction. I just realized that the write amount per process per dataset is
512^3 * 4bytes / 1024 = 0.5MB for the 1024 process case is small.
So, H5Dwrite_multi should be able to help.

When I do the “communication” phase explicitly (before h5dwrite) to aggregate contiguous data, it is very fast (e.g., < 0.1s, when the h5dwrite takes several seconds or more). Why is it slow for hdf5/mpiio? Is it more conservative about memory? Or perhaps it uses a general algorithm that cannot take advantage of the simple aggregation strategies for a regular (checkerboard) decomposition of a regular array? E.g., does it try to do mpi all_to_all?

So, summarizing, I see two possible ways:

  1. Storing the data in a 5D in-file representation of nx×ny×nz×na×np, where np is the number of processors = npx×npy×npz. This would allow me to store the data without any rearrangement from the memory layout to the file layout. This has the disadvantage that the resulting file layout depends on the processor subdomain layout! And I can not change the number of processors, which I can using my existing monolithic HDF5 format.

  2. Using H5Dwrite_multi, but I still do not understand how I could stop the interleaving (rearranging of data) while storing the data into the file. Is there a way to store this data in such 4D chunks (nx×ny×nz×na), so that reading the file, each processor would automatically read its own portion of the data, no matter if I change the number of processors?

Thanks for your comments so far - and best greetings!

The communication phase in MPI-IO rearranges the write requests that
are aligned with file striping boundaries. This is critical to prevent file
system locks. If your rearrangement is not aligned in such way, writing
to a parallel file system can be costly.

And I can not change the number of processors, which I can using my existing monolithic HDF5 format.

Sadly, that is correct.

Using H5Dwrite_multi, but I still do not …

H5Dwrite_multi does not change data layout in the file. Files created by H5Dwrite_multi or H5Dwrite are exact the same.

HDf5 chunking can help, if the local nx,ny,nz are of the same among all
processes (set the chunk size to nxxnyxnz), which could reduce the cost
of the MPI-IO communication phase. Please give it a try.

I thought it was easily worth the time to code up the aggregation (in x) via mpi before calling h5dwrite. That way I got the performance increase, but the file layout remained a standard array, independent of domain decomposition.

Alternatively, you can switch to using CGNS, which can use multi-datasets, other HDF5 features, and HDF5 optimizations. One such HDF5 feature is subfiling, which has shown exceptional performance with CGNS at extreme scales, especially if node-local storage is available.

I would advise against baking in the number of ranks into the I/O schema.

You could also try/experiment using the log VOL, https://github.com/DataLib-ECP/vol-log-based, which could alleviate the memory layout issues.

Also, HDF5 is straightforward to install, don’t let that stop you from using 1.14.

Is there any consideration of optimizing this within hdf5? It seems like a very common use case.

@brtnfld
Thanks for your suggestions. I am reluctant to introduce yet another abstraction layer in our IO scheme. How would CGNS solve our problem of interleaved data that needs to be rearranges while writing? And if CGNS does solve the problem using the HDF5 library, then I strongly prefer to implement this solution directly in our HDF5-IO module. (Save argument stands for VOL.)

If you would like to store data in the canonical order layout, then the
communication phase in MPI-IO is unavoidable. Optimizations
will have to be in the MPI-IO. One of such is TAM.
But its implementation has not been incorporated into ROMIO.

If you were OK with storing data in the log layout, then the Log VOL
mentioned by @brtnfld can be used without changing the application
codes. However, directly reading data from file in the log layout can
be expensive. Log VOL also provides a file conversion utility for
converting the log layout to the canonical order.

@wernerg @brtnfld

It seems like a very common use case.

I totally agree. My preferred solution would be an improved “H5Dwrite_multi_interleaved” that would store the data internally in processor-aligned format (chunks of exactly the processor’s in-memory alignment) but that would present the data to the outside world simply as one large global data array. While reading, the HDF5 library would then take care to read the requested data portions of the chunks and combine them in the read buffer. This would improve the writing speed and memory consumption (which is crucial), while the reading might need a bit more time but remains as easy as possible.

PS: Please note that, while the simulation runs, writing is done usually much more often than reading! Therefore the writing speed is so important here.

I totally agree. My preferred solution …

Isn’t this the HDF5 chunking feature?

@wkliao

Isn’t this the HDF5 chunking feature?

I think not. With chunking I would not solve the problem of the interleaved data that needs to be rearranged on writing. I would just say that my data can be divided in evenly-sized blocks.

Yes, we did some prototyping for a new feature to address the issue in the context of multi-dataset and chunked datasets, but it has not been worked on recently.

sent3_hacc_io_patterns.pdf (180.7 KB)