SEGFAULT in H5DWrite for large parallel collective dataset - only when chunked

My question upfront: Does this error signature and stack trace correspond with a know bug? Is there any workaround? Does this look like an MPICH issue or HDF5 issue?

Hi, I’ve recently begun experimenting with chunking my large rank 7 and rank 8 datasets so that I can do a compression filter. I had good initial results with “COMPRESSION DEFLATE { LEVEL 6 }” reducing nominal 27 GB dataset and file to 10 GB and reducing the overall write time by half. ( the function is mostly simulation checkpoints done periodically on a parallel job 10’s to 100’s of ranks, all with the same local problem size.)

The problem is that now I repeatably encounter a segfault inside H5DWrite after increasing the overall dataset size (~36GB), but reducing the number of MPI ranks 64->32, and chunk size somewhat.

The attempted chunk size here with the segfault is (1, 1, 16, 16, 16, 243, 2 ) and the dataset size is ( 48, 48, 16, 16, 16, 243, 2 ). (the 32 ranks evenly distribute the leading 48x48 dims)

If I comment-disable the ~four lines that introduce chunking to the dataset, then the file is written correctly without segfault.

This is on a HPE Cray Slingshot-11 system and build/run with these pertinent module versions:
cray-hdf5-parallel/1.12.2.3
cray-mpich/8.1.26

$ h5dump -pH -d /variables/pdf_n/data simulation_checkpoint_state.h5 
HDF5 "simulation_checkpoint_state.h5" {
DATASET "/variables/pdf_n/data" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SIMPLE { ( 48, 48, 16, 16, 16, 243, 2 ) / ( 48, 48, 16, 16, 16, 243, 2 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      SIZE 36691771392
      OFFSET 3450104
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_NEVER
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_EARLY
   }
   ATTRIBUTE "startIndices" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 7 ) / ( 7 ) }
   }
}
}

Hi @noah,

this doesn’t appear to correspond to any currently known issue with HDF5, but I can’t tell whether this looks like an MPICH or HDF5 issue. A couple of things first:

  • It looks like the system has HDF5 1.12.2 installed. Are there any newer versions (1.14.0 or newer) available on the system that you could try?
  • Do you happen to have collective metadata writes and/or reads enabled by having called H5Pset_coll_metadata_write and/or H5Pset_all_coll_metadata_ops? If so, can you try disabling those lines too to see if there’s an issue related to that? The performance probably won’t be great, but it may at least give a direction to look in for the issue.
  • Do you happen to have an example program that can simulate this crash?

There have been some issues with certain newer MPICH versions in the 4.0.x range, but the Cray MPICH versioning can make it difficult to tell if those same issues are present. We’ve tested Cray MPICH 8.1.26 previously, but I believe that we need to do some more regular testing with 8.1.26.

Hi,

Thank you for these considerations.

  • cray-hdf5-parallel/1.12.2.3 is the newest version available on the two Dept. of Energy systems I am working with. I’m going to presume the latest Cray PE is not much further ahead than this. I don’t have any experience building or running parallel HDF5 releases that are not built by HPE as part of the Cray PE.
  • I do not use of H5Pset_coll_metadata_write and/or H5Pset_all_coll_metadata_ops
  • I do not have an example program and of course the real program is big and complex. That being said, to distill down to the HDF5 and MPICH API calls performed leading up to the crash is a fairly small set – perhaps 20 lines of code. I’ll think about a reproducer as I have time.

I wish that h5repack worked in parallel with MPI. This might serve to be a reproducer (with some luck). But it would also be a valuable tool testing out, say, new compression filters, or chunking schemes on my production datasets.

Cheers,

Noah

Hi @noah,

cray-hdf5-parallel/1.12.2.3 is the newest version available on the two Dept. of Energy systems I am working with

Ah ok, I was hoping there might be a newer version available since there are several fixes among other things in the 1.14 releases. That said, if you wanted to try building a newer version of HDF5 just to see if it works, you could start with Building HDF5 with CMake if CMake is available on the system or with https://github.com/HDFGroup/hdf5/blob/develop/release_docs/INSTALL for using Autotools. I’d be happy to help if you run into problems while trying to build HDF5.

I do not use of H5Pset_coll_metadata_write and/or H5Pset_all_coll_metadata_ops

Based on the trace you provided it looks like collective metadata operations were enabled somewhere, so I’m guessing it may be above the level of your code. However, if your code deals directly with opening files by calling, for example, H5Fopen, then you should be able to create a File Access Property List with fapl_id = H5Pcreate(H5P_FILE_ACCESS) and then call H5Pset_coll_metadata_write(fapl_id, 0) and H5Pset_all_coll_metadata_ops(fapl_id, 0) in order to disable those features.

I do not have an example program and of course the real program is big and complex

Understandable. I hope that you might be able to reproduce the problem with a small example though since it can be difficult to debug issues like this otherwise. The issue could be in HDF5, but sometimes the problems come down to a small miscalculation in the user’s program, while other times it can end up being an issue in the MPI implementation. Does the system have functionality available for debugging your program, such as valgrind or valgrind4hpc?

I wish that h5repack worked in parallel with MPI

There has been some talk around implementing a parallel version of h5repack, but so far it hasn’t really gone anywhere. I agree it would be handy to have though!

1 Like