large parallel/collective data set crashes and lustre striping

wernerg · December 14, 2022, 10:22pm

Despite having run many very large PIC simulations with parallel collective hdf5 output, I sometimes experience crashes on some supercomputers (and smaller systems) for large simulations.

In general, I’d be glad of advice/requirements for parallel-writing large datasets, when each mpi rank writes a separate part of a file – in a way that scales to hundreds of thousands of mpi ranks. At the moment, performance is not an issue; the problem is crashing, e.g., when a simulation is weak-scaled to a larger size.

In trying to track down and reduce the problem, I’ve run into problems on much smaller datasets that I’d like to understand, posssibly related to 32-bit integers or addresses (e.g., with problems occurring for datasets around 4 GB).

I’ve been running a test case, a very slightly modified hdf5 example program (described more below) from:

On different machines, it succeeds/fails in different ways. E.g, when running 2 ranks on 2 nodes (1 mpi rank per node):

On ALCF Theta (intel 19.1.0.166, cray-mpich 7.7.14, cray-hdf5 10.6.1, Lustre), it writes a 100GB dataset with no problem (even using 1 lustre stripe for the file). Hooray!

On TACC Frontera (intel 19.1.1, impi 19.0.9, hdf5 1.10.4, Lustre), it writes a 5GB dataset with no problem; however, when writing an 11 GB dataset, it crashes if I use 1 stripe with 1MB size for the file, but succeeds if I use either 2 stripes (1MB size) or 1 stripe (2MB size). I have no idea why striping affects whether it crashes.

On our local small cluster (gfortran/mpich/hdf5 1.10.5), TACC Stampede2 KNL (intel/impi 18.0.2, hdf5 1.10.4, Lustre), and Cyfronet Prometheus (hdf5 1.12.0), the program segfaults when the entire dataset is larger than about 4 GB – e.g., it succeeds for 3.998 GB but fails for 4.08 GB. [On TACC Stampede2/KNL I additionally modified the script to set chunking to chunks of size 0.13 GB and succesfully wrote a file of size 4.08 GB, but failed to write a file of size 7.998 GB.]

For these tests I modified the fortran90 hdf5 example script from

(scroll down to find hyperslab_by_col.f90)

changing (only) the following 2 lines to test with larger arrays:
from
INTEGER(HSIZE_T), DIMENSION(2) :: dimsf = (/5,8/) ! Dataset dimensions.
INTEGER(HSIZE_T), DIMENSION(2) :: dimsfi = (/5,8/)
to
INTEGER(HSIZE_T), PARAMETER :: M = 7
INTEGER(HSIZE_T), PARAMETER :: N = 104 * INT(960*1536, 8)
INTEGER(HSIZE_T), DIMENSION(2) :: dimsf = (/M,N/) ! Dataset dimensions.
INTEGER(HSIZE_T), DIMENSION(2) :: dimsfi = (/M,N/)

The array being written has M x N elements (total), each a 32-bit integer; half the array is written by each mpi rank. The strange choice has historical roots in the original dataset of a crashing simulation.

I compiled with the supercomputer’s default compiler/mpi/parallel hdf5, with something like (e.g., on Frontera)
ftn -g -traceback -check bounds -O1 -no-wrap-margin hyperslab_by_col.f90

When I do get a backtrace, it points to the h5dwrite_f call, and a list of other functions in libhdf5, libmpi, libthread. Sometimes Frontera says:
Abort(941240083) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: Request pending due to failure, error stack:
PMPI_Waitall(346): MPI_Waitall(count=1, req_array=0x7f2000059cc0, status_array=0x1) failed
PMPI_Waitall(322): The supplied request in array element 0 was invalid (kind=0)

Thanks for any help,
Greg.

brtnfld · December 15, 2022, 3:51pm

Would you mind trying Hyperslab_by_row.c to see if it has the same issue?
Could you post the complete backtrace?
Does it always work for independent writes?
I don’t think it will matter, but you could try passing a pointer to H5Dwrite instead.
Could you also try with the develop branch of HDF5?

wernerg · December 15, 2022, 7:16pm

(1) Hyperslab_by_row.c behaves the same as hyperslab_by_col.f90 (some details below).

(2) I’ve posted the error messages and backtrace below.

(3) I’ll show results of the independent writes in another reply. I’m not sure if I’m doing it right, but in short it appears to be worse, but gives different error messages, from hdf5.

(4) Can you elaborate on passing a pointer to H5Dwrite instead? I.e., for which argument, and do you mean in fortran, or in c?

(5) Trying with a development branch of HDF5 would be difficult, so I’d like to leave that as a later resort.

Details on (1):

Hyperslab_by_row.c (i.e., the c version of hyperslab_by_col.f90) appears to show the same behavior, at least at TACC Frontera (where the crash depends on striping) and Stampede2/KNL (where the crash occurs for a 4.0 GB file, but not for a 3.999 GB file).

For Hyperslab_by_row.c, I made just the following changes:
#define NX 268 * 960ul1536 / dataset dimensions */
#define NY 7

(2) Errors/backtraces:

From Frontera, I just get the error (when the striping is -c 1 -S 1m):
Abort(874131219) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: Request pending due to failure, error stack:
PMPI_Waitall(346): MPI_Waitall(count=1, req_array=0x7f20001ebc80, status_array=0x1) failed
PMPI_Waitall(322): The supplied request in array element 0 was invalid (kind=0)

I tried running valgrind, but I get the same valgrind messages for failed and successful (e.g., striping -c 1 -S 2m) runs.

from Stampede/KNL (fortran, hyperslab_by_col.f90), I get a backtrace:
…
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
hyperslab_by_col 000000000040ACCE Unknown Unknown Unknown
libpthread-2.17.s 00002AF9E525D630 Unknown Unknown Unknown
hyperslab_by_col 000000000040A720 Unknown Unknown Unknown
libpthread-2.17.s 00002AF9E525D630 Unknown Unknown Unknown
libpthread-2.17.s 00002AF9E525A59B pthread_spin_tryl Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
hyperslab_by_col 000000000040ACCE Unknown Unknown Unknown
libpthread-2.17.s 00002B13EF12A630 Unknown Unknown Unknown
libmpi.so.12 00002B13EE1BFFE2 PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12 00002B13EE569E3F PMPI_Waitall Unknown Unknown
libmpi_lustre.so. 00002B13F8597FD1 Unknown Unknown Unknown
libmpi_lustre.so. 00002B13F8596BF7 ADIOI_LUSTRE_Writ Unknown Unknown
libmpi.so.12.0 00002B13EE57AF4C Unknown Unknown Unknown
libmpi.so.12 00002B13EE57BFC5 PMPI_File_write_a Unknown Unknown
libhdf5.so.103.0. 00002B13ED7849F8 Unknown Unknown Unknown
libhdf5.so.103.0. 00002B13ED53F1B7 H5FD_write Unknown Unknown
libhdf5.so.103.0. 00002B13ED518B4F H5F__accum_write Unknown Unknown
libhdf5.so.103.0. 00002B13ED64F421 H5PB_write Unknown Unknown
libhdf5.so.103.0. 00002B13ED525BE9 H5F_block_write Unknown Unknown
libhdf5.so.103.0. 00002B13ED774F22 H5D__mpio_select_ Unknown Unknown
libhdf5.so.103.0. 00002B13ED7755F4 H5D__contig_colle Unknown Unknown
libhdf5.so.103.0. 00002B13ED4E2829 H5D__write Unknown Unknown
libhdf5.so.103.0. 00002B13ED4E1EBF H5Dwrite Unknown Unknown
libhdf5_fortran.s 00002B13ED1DF985 h5dwrite_f_c Unknown Unknown
libhdf5_fortran.s 00002B13ED1DB47C h5_gen_mp_h5dwrit Unknown Unknown
hyperslab_by_col 0000000000409D26 MAIN__ 105 hyperslab_by_col.f90
hyperslab_by_col 000000000040993E Unknown Unknown Unknown
libc-2.17.so 00002B13EF65B555 __libc_start_main Unknown Unknown
hyperslab_by_col 0000000000409829 Unknown Unknown Unknown
[mpiexec@c455-033.stampede2.tacc.utexas.edu] control_cb (…/…/pm/pmiserv/pmiserv_cb.c:864): connection to proxy 1 at host c455-034 failed
…
where line 105 is the h5dwrite_f line.

Thanks,
Greg.

wernerg · December 15, 2022, 7:30pm

Regarding collective vs. independent. If the code is correct (for independent writing), then independent is similar, but if anything worse than collective.

I changed hyperslab_by_row.f90; in addition to changing the array size as described before, I simply changed H5FD_MPIO_COLLECTIVE_F to H5FD_MPIO_INDEPENDENT_F, with no other changes to source or build/run configuration. (I’ve never considered independent writing, and am not sure this is valid, but it does work for small enough files.)

Again, on Stampede2/KNL (and my local cluster), it worked for file size 3.999 GB, but failed for 4.08 GB (the same as with collective writing). However, instead of segfaulting, it saved a file with the dataset all zeros, and issued hdf5 error messages (below).

On Frontera, for an 11 GB file, it failed regardless of striping (whereas with collective it failed only if the stripes and stripe size were too few and too small). Again, instead of segfaulting, it created a dataset with all zeros and issued hdf5 errors (below).

Here are the hdf5 errors, which were essentially identical in all failed cases.
…

HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 1:
#000: H5Dio.c line 336 in H5Dwrite(): can’t write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 828 in H5D__write(): can’t write data
major: Dataset
minor: Write failed
#002: H5Dcontig.c line 633 in H5D__contig_write(): contiguous write failed
major: Dataset
minor: Write failed
#003: H5Dselect.c line 314 in H5D__select_write(): write error
major: Dataspace
minor: Write failed
#004: H5Dselect.c line 225 in H5D__select_io(): write error
major: Dataspace
minor: Write failed
#005: H5Dcontig.c line 1269 in H5D__contig_writevv(): can’t perform vectorized read
major: Dataset
minor: Can’t operate on object
#006: H5VM.c line 1500 in H5VM_opvv(): can’t perform operation
major: Internal error (too specific to document in detail)
minor: Can’t operate on object
#007: H5Dcontig.c line 1198 in H5D__contig_writevv_cb(): block write failed
major: Dataset
minor: Write failed
#008: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#009: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#010: H5Faccum.c line 826 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#011: H5FDint.c line 258 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#012: H5FDmpio.c line 1744 in H5FD_mpio_write(): can’t convert from size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range
HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 0:
#000: H5Dio.c line 336 in H5Dwrite(): can’t write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 828 in H5D__write(): can’t write data
major: Dataset
minor: Write failed
#002: H5Dcontig.c line 633 in H5D__contig_write(): contiguous write failed
major: Dataset
minor: Write failed
#003: H5Dselect.c line 314 in H5D__select_write(): write error
major: Dataspace
minor: Write failed
#004: H5Dselect.c line 225 in H5D__select_io(): write error
major: Dataspace
minor: Write failed
#005: H5Dcontig.c line 1269 in H5D__contig_writevv(): can’t perform vectorized read
major: Dataset
minor: Can’t operate on object
#006: H5VM.c line 1500 in H5VM_opvv(): can’t perform operation
major: Internal error (too specific to document in detail)
minor: Can’t operate on object
#007: H5Dcontig.c line 1198 in H5D__contig_writevv_cb(): block write failed
major: Dataset
minor: Write failed
#008: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#009: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#010: H5Faccum.c line 826 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#011: H5FDint.c line 258 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#012: H5FDmpio.c line 1744 in H5FD_mpio_write(): can’t convert from size to size_i
major: Internal error (too specific to document in detail)
minor: Out of range
TACC: Shutdown complete. Exiting.
…

Thanks,
Greg.

brtnfld · December 15, 2022, 8:26pm

We think we have all the fixes for this problem in the hdf5_1_12 branch. The issue is that work was never backported to 1.10. So, most likely, you will need to build hdf5 for it to work. I don’t think you need to cross-compile HDF5, so it should be a straightforward install (you don’t need root access). HDF5 1.14 should be released shortly, or you can use the hdf5_1_14 branch. 1.14 is develop at this point. You could also contact TACC to install HDF5 1.14.0 when it is released.

FYI, Fortran 2003 API, passing pointer:

subroutine h5d::h5dwrite_f ( integer(hid_t), intent(in) dset_id,
integer(hid_t), intent(in) mem_type_id,
type(c_ptr), intent(in) buf,
integer, intent(out) hdferr,
integer(hid_t), intent(in), optional mem_space_id,
integer(hid_t), intent(in), optional file_space_id,
integer(hid_t), intent(in), optional xfer_prp
)

wernerg · December 15, 2022, 8:43pm

The same failure occurs with hdf5 1.12.0 on Cyfronet Prometheus (as on Stampede2, working for a 3.99 GB file, crashing for a 4.08 GB file) – backtrace is below.

Should the issue you refer to be fixed in 1.12.0? (Or does it depend on more detailed version info?)

Also, do think that Frontera (which has no problem writing a 5 GB file with hyperslab_by_col.f90, but can’t write an 11 GB file with 1 stripe) is experiencing the same problem as Stampede2, which can’t write a 4 GB file?

On Prometheus, 4.08 GB file:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source

hyperslab_by_col 0000000000E5179A Unknown Unknown Unknown
libpthread-2.17.s 00002B8F6776C630 Unknown Unknown Unknown
libmpifort.so.12. 00002B8F665F72B0 I_MPI___intel_a Unknown Unknown
libmpi.so.12.0 00002B8F66C7DF45 Unknown Unknown Unknown
libmpi.so.12.0 00002B8F66C78A2F Unknown Unknown Unknown
libmpi.so.12.0 00002B8F66C70A66 Unknown Unknown Unknown
libmpi.so.12.0 00002B8F66C3AD30 Unknown Unknown Unknown
libmpi.so.12 00002B8F669FEB98 PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12 00002B8F66DAD2BF PMPI_Waitall Unknown Unknown
libmpi_lustre.so. 00002B8F6A8EDFD1 Unknown Unknown Unknown
libmpi_lustre.so. 00002B8F6A8ECBF7 ADIOI_LUSTRE_Writ Unknown Unknown
libmpi.so.12.0 00002B8F66DBE3CC Unknown Unknown Unknown
libmpi.so.12 00002B8F66DBF445 PMPI_File_write_a Unknown Unknown
hyperslab_by_col 0000000000D4F038 Unknown Unknown Unknown
hyperslab_by_col 00000000006F91C2 Unknown Unknown Unknown
hyperslab_by_col 0000000000DECB7A Unknown Unknown Unknown
hyperslab_by_col 00000000009C2689 Unknown Unknown Unknown
hyperslab_by_col 00000000006A59F7 Unknown Unknown Unknown
hyperslab_by_col 0000000000D2BF3B Unknown Unknown Unknown
hyperslab_by_col 0000000000D376B6 Unknown Unknown Unknown
hyperslab_by_col 0000000000D36E97 Unknown Unknown Unknown
hyperslab_by_col 0000000000D2DFA0 Unknown Unknown Unknown
hyperslab_by_col 00000000005F7C80 Unknown Unknown Unknown
hyperslab_by_col 0000000000CB3436 Unknown Unknown Unknown
hyperslab_by_col 0000000000C68357 Unknown Unknown Unknown
hyperslab_by_col 0000000000C687B0 Unknown Unknown Unknown
hyperslab_by_col 00000000005F443D Unknown Unknown Unknown
hyperslab_by_col 000000000048E7EA Unknown Unknown Unknown
hyperslab_by_col 000000000047D962 Unknown Unknown Unknown
hyperslab_by_col 00000000004062F1 MAIN 105 hyperslab_by_col.f90
hyperslab_by_col 0000000000405F22 Unknown Unknown Unknown
libc-2.17.so 00002B8F6799B555 __libc_start_main Unknown Unknown
hyperslab_by_col 0000000000405E29 Unknown Unknown Unknown

Greg.

wernerg · December 15, 2022, 8:52pm

I also get the same error on Frontera when using phdf5/1.12.2 (as when using 1.10).

But on ALCF Theta with hdf5 10.6.1, there’s no problem.

Greg.

jan-willem.blokland · December 16, 2022, 4:57pm

Hello Greg,

I do not know if this helps, but I noticed that things start to fail when you are using Intel-mpi.

In the past, I had also trouble using various versions of intel-mpi. Either my program hangs or I got a MPI error. My test was done on local HDD. Even a couple of the MPI HDF5 tests fail. I have reported these issues to Intel and they confirmed it. My observations at the time was when a rank writes more than 2 GB wierd things start to happen. The intel-MPI version which works for me is 2021.2.0. Having said this all, I am not sure if you are facing the same problems as I did but if possible I would suggest to use intel-MPI 2021.2.0.

Best regards,
Jan-Willem

wernerg · December 16, 2022, 5:36pm

Jan-Willem,

thanks; that’s helpful to know. I do see one of these issues with gcc/mpich (on our old local cluster) – but there may be multiple problems that I’m running into.

E.g., on the local cluster with gcc/mpich, I sometimes get a completely different error suggesting that we haven’t built mpich or parallel hdf5 quite right, so that could be irrelevant to the problems I’m having on Frontera/Stampede2 with intel/impi.)

Greg.

wernerg · December 20, 2022, 12:04am

@brtnfld

Do you mean the fixes are in the hdf5_1_14 branch? Or hdf5_1_12 branch?
Thanks,
Greg.

wernerg · February 1, 2023, 6:27pm

Confirming @jan-willem.blokland’s suspicions, on TACC Frontera I have verified (with help from TACC) that the problems described above with the hyperslab_by_col.f90 example and striping were not fixed by hdf5 1_14, but were fixed by using intel/impi 2021 (with parallel hdf5 1.10). This also fixed a problem with my actual simulation code that was causing it to stall when a file size reached 20 TB.

I have not been able to test the other failure mode I observed (Frontera failed in a different, striping-dependent way from other machines like Stampede and Prometheus) against a new hdf5 version or a new intel/impi.

jan-willem.blokland · February 2, 2023, 8:31am

@wernerg, thanks for letting us know that intel/impi 2021 version solves your problems on TACC Frontera. Yet another datapoint for me to keep in mind.

Maybe also useful to know. Recently, I start using parallel HDF5 1.14.0 in combination with intel/impi 2021 and for a small test I did on NFS there was huge performance improvement compared to parallel HDF5 1.12.1 with intel/impi 2021. I mean the test used take about 50 seconds and now drop to about a second. Of course, NFS is not ideal for parallel I/O.

wernerg · February 2, 2023, 6:47pm

And, thanks to TACC, I can now confirm that intel/ifort/impi 2021.7.0 (with parallel hdf 1.10.4) fixes the other bug on Stampede2/KNL (which might have been related, but manifested differently from the one on Frontera, but looked the same as the bug on Prometheus).

@jan-willem.blokland Again, thanks to TACC, I also tried hdf5 1.14 on Frontera (with the intel/impi 2021) and it was actually slower than hdf5 1.10 – however the test was a toy problem, 1.14 was a quickly built temporary install, so I conclude this is just motivation for better testing of speed.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

large parallel/collective data set crashes and lustre striping