My question upfront: Does this error signature and stack trace correspond with a know bug? Is there any workaround? Does this look like an MPICH issue or HDF5 issue?
Hi, I’ve recently begun experimenting with chunking my large rank 7 and rank 8 datasets so that I can do a compression filter. I had good initial results with “COMPRESSION DEFLATE { LEVEL 6 }” reducing nominal 27 GB dataset and file to 10 GB and reducing the overall write time by half. ( the function is mostly simulation checkpoints done periodically on a parallel job 10’s to 100’s of ranks, all with the same local problem size.)
The problem is that now I repeatably encounter a segfault inside H5DWrite after increasing the overall dataset size (~36GB), but reducing the number of MPI ranks 64->32, and chunk size somewhat.
The attempted chunk size here with the segfault is (1, 1, 16, 16, 16, 243, 2 ) and the dataset size is ( 48, 48, 16, 16, 16, 243, 2 ). (the 32 ranks evenly distribute the leading 48x48 dims)
If I comment-disable the ~four lines that introduce chunking to the dataset, then the file is written correctly without segfault.
This is on a HPE Cray Slingshot-11 system and build/run with these pertinent module versions:
cray-hdf5-parallel/1.12.2.3
cray-mpich/8.1.26
$ h5dump -pH -d /variables/pdf_n/data simulation_checkpoint_state.h5
HDF5 "simulation_checkpoint_state.h5" {
DATASET "/variables/pdf_n/data" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 48, 48, 16, 16, 16, 243, 2 ) / ( 48, 48, 16, 16, 16, 243, 2 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 36691771392
OFFSET 3450104
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_NEVER
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "startIndices" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 7 ) / ( 7 ) }
}
}
}