Despite having run many very large PIC simulations with parallel collective hdf5 output, I sometimes experience crashes on some supercomputers (and smaller systems) for large simulations.
In general, I’d be glad of advice/requirements for parallel-writing large datasets, when each mpi rank writes a separate part of a file – in a way that scales to hundreds of thousands of mpi ranks. At the moment, performance is not an issue; the problem is crashing, e.g., when a simulation is weak-scaled to a larger size.
In trying to track down and reduce the problem, I’ve run into problems on much smaller datasets that I’d like to understand, posssibly related to 32-bit integers or addresses (e.g., with problems occurring for datasets around 4 GB).
I’ve been running a test case, a very slightly modified hdf5 example program (described more below) from:
On different machines, it succeeds/fails in different ways. E.g, when running 2 ranks on 2 nodes (1 mpi rank per node):
On ALCF Theta (intel 19.1.0.166, cray-mpich 7.7.14, cray-hdf5 10.6.1, Lustre), it writes a 100GB dataset with no problem (even using 1 lustre stripe for the file). Hooray!
On TACC Frontera (intel 19.1.1, impi 19.0.9, hdf5 1.10.4, Lustre), it writes a 5GB dataset with no problem; however, when writing an 11 GB dataset, it crashes if I use 1 stripe with 1MB size for the file, but succeeds if I use either 2 stripes (1MB size) or 1 stripe (2MB size). I have no idea why striping affects whether it crashes.
On our local small cluster (gfortran/mpich/hdf5 1.10.5), TACC Stampede2 KNL (intel/impi 18.0.2, hdf5 1.10.4, Lustre), and Cyfronet Prometheus (hdf5 1.12.0), the program segfaults when the entire dataset is larger than about 4 GB – e.g., it succeeds for 3.998 GB but fails for 4.08 GB. [On TACC Stampede2/KNL I additionally modified the script to set chunking to chunks of size 0.13 GB and succesfully wrote a file of size 4.08 GB, but failed to write a file of size 7.998 GB.]
For these tests I modified the fortran90 hdf5 example script from
(scroll down to find hyperslab_by_col.f90)
changing (only) the following 2 lines to test with larger arrays:
from
INTEGER(HSIZE_T), DIMENSION(2) :: dimsf = (/5,8/) ! Dataset dimensions.
INTEGER(HSIZE_T), DIMENSION(2) :: dimsfi = (/5,8/)
to
INTEGER(HSIZE_T), PARAMETER :: M = 7
INTEGER(HSIZE_T), PARAMETER :: N = 104 * INT(960*1536, 8)
INTEGER(HSIZE_T), DIMENSION(2) :: dimsf = (/M,N/) ! Dataset dimensions.
INTEGER(HSIZE_T), DIMENSION(2) :: dimsfi = (/M,N/)
The array being written has M x N elements (total), each a 32-bit integer; half the array is written by each mpi rank. The strange choice has historical roots in the original dataset of a crashing simulation.
I compiled with the supercomputer’s default compiler/mpi/parallel hdf5, with something like (e.g., on Frontera)
ftn -g -traceback -check bounds -O1 -no-wrap-margin hyperslab_by_col.f90
When I do get a backtrace, it points to the h5dwrite_f call, and a list of other functions in libhdf5, libmpi, libthread. Sometimes Frontera says:
Abort(941240083) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: Request pending due to failure, error stack:
PMPI_Waitall(346): MPI_Waitall(count=1, req_array=0x7f2000059cc0, status_array=0x1) failed
PMPI_Waitall(322): The supplied request in array element 0 was invalid (kind=0)
Thanks for any help,
Greg.