Crash when writing parallel compressed chunks

Also, I clearly forgot to build in debug mode. I’ll rebuild now and reattach the stack traces if they have have anything more useful in them.

From those traces, it looks like rank 0 and 1 allocated chunks out of address order and got stuck in a sort loop due to a naive mistake I just spotted in the patch I uploaded. Ranks 2 and 3, however, seem to have allocated chunks in the correct address order and went on to try doing the collective chunk write. Here’s a new patch that hopefully fixes that issue.

H5Dchunk_assertion.patch (2.6 KB)

Meanwhile, I’m going to see if I can find some system we can test this on where I can get the same results. It can be a bit non-deterministic as it relies on the chunks being allocated out of order and I’m not yet certain on a way to force this condition. Thankfully, the issue seems reproducible on a wide variety of systems, so this shouldn’t be too difficult.

Bingo. That patch works! Thanks @jhenderson

@jhenderson as it sounded like a known bug is there a JIRA issue for this one? I just want to keep track of it so I know when it has been merged. We’re currently using a patched version and I want to know when we can move back to a released version. Thanks again!

Hi @jrichardshaw,

The JIRA issue we have for this is: HDFFV-10792.

I have a slightly revised set of changes that’s currently sitting in a PR for the HDF5 development branch, wherein we’re discussing coming up with a test/modifying what’s given here to be a test for the issue. However, I think it would be safe to assume that the fix for this issue will make it into the upcoming HDF5 release. Once that happens, we will probably need to work on assessing how much the newly-introduced sort call affects parallel performance and whether we should be trying to tackle the issue at different layers within the library so that we don’t get to the point where we need the sort.

@jhenderson I’ve just revisited this one as we’ve been trying to get our code running with your patch and I’ve found that with slight changes to the chunking parameters it still crashes. I’ve updated the test code I used above very slightly to change the chunk params (https://gist.github.com/jrs65/97e36592785d3db2729a8ed20521eaa6).

The various sets of parameters used and their behaviour in three versions of HDF5 (1.10.5, 1.10.5 with your final patch above, and 1.10.6) is documented in the comments in the gist. The salient points are that the original test case still failed on 1.10.6 (hanging rather than crashing); and that with a slight change to chunking parameters (small chunk size but more of them, which increases the total axis size) all versions, including your patched version crash.

I get slightly different messages depending on whether I enable compression or not, but it’s pretty much the same across all versions. The debug error messages I get without compression are:

HDF5-DIAG: Error detected in HDF5 (1.10.6) HDF5-DIAG: Error detMPI-process 2:
  #ected in HDF5 (1.10.6) 000: H5D.c line 151 in H5Dcreate2MPI-process 3:
(): unable to create dataset
    major: Dataset
    minor  #000: H5D.c: Unable to initialize object
  #001: H5Dint.c line 337 in H5D__create_named() line 151 in H5Dcreate2(): unable to create dataset
    major:: unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002 Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line: H5L.c line 1592 in H5L_link_object(): unable to create new link to object
    major: Links
     337 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minorminor: Unable to initialize object
  #003: H5L.c line 1833 in H5L__create_real(): : Unable to initialize object
  #002: H5L.c line 1592 in H5L_link_object(): unable to create new link to object
can't insert link
    major: Links
    minor: Unable to insert object
  #004:     major: Links
    minor: Unable to initialize object
  #003:H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    m H5L.c line 1833 in H5L__create_real(): can't insert link
    major: Links
inor: Object not found
  #005: H5Gtraverse.c line 582 in H5G__traverse_real(): can't look up component
    minor: Unable to insert object
  #004: H5Gtraverse.c line 851 in    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1126 in  H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: line 582 in H5G__traverse_real(): can't look up component
    major: Symbol table
    miH5Gobj.c line 327 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1126 in H5G__obj_lookup(): nor: Can't get value
  #008: H5Omessage.c line 883 in H5O_msg_exists(can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009:: H5Gobj.c line 327 in H5G__obj_get_linfo(): unable to read object header
    major: H5Oint.c line 1066 in H5O_protect(): unable to load object header
    major: Object header
    m Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 883 iinor: Unable to protect metadata
  #010: H5AC.c line 1352 in H5AC_protect(): H5C_protect() failedn H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata

    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c l  #009: H5Oint.c line 1066 in H5O_protect(): unable to load object header
    majine 2298 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Object header
    minor: Unable to protect metadata
  #010: H5AC.c lineor: Some MPI function failed
  #012: H5C.c line 2298 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
 1352 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    mino    major: Internal error (too specific to document in detail)
    minor: MPI Error String
r: Unable to protect metadata
  #011: H5C.c line 2298 in H5C_protect(): MPI_Bcast failed
rank=2 writing dataset2
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2298 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=3 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5Dio.c line 314 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5D.c line 337 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5Dio.c line 314 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5D.c line 337 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything

Any ideas that’s going wrong in here?

Thanks!

Hi @jrichardshaw,

I can’t say that I quite know what’s going on from an initial glance, but since I can see that MPI_ERR_TRUNCATE is being returned by MPI, it’s possible that the MPI ranks might be getting out of sync with each other. I don’t have high hopes that this will fix the problem, but could you try making a call to MPI_Barrier(comm) between the writing of your first dataset and the creation of the next? I get the feeling that some of the ranks may have managed to rush ahead of the other ranks and then they try creating a dataset when the other ranks are still writing to the first dataset. If the ranks that rushed ahead manage to match up an MPI_Bcast() call with the other ranks, the call will be made with mismatched arguments and is a likely candidate for causing the MPI message truncation issue.

In related news, the patch for the first issue in this thread hasn’t been merged into the development branch yet, but should be soon for the upcoming release. It should also be easy to bring that fix back to the 1.10.x series. So, I would expect that your test case will still fail in version 1.10.6, although it is surprising to me that it changes to a hang rather than a crash. Perhaps there might be something about where the ranks are getting caught in this new hang that may point us in the right direction, but for now I think that a patched 1.10.5 will probably work better for you since 1.10.6 doesn’t yet have the fix.

In any case, I’ll try to look at this in the next few days as I get some time, but anything else you might discover about the problem will definitely be useful.

Hi @jhenderson. Thanks for the reply. I’ve just quickly tried inserting a MPI_Barrier call into that test case between the two dataset writes and it didn’t seem to help I’m afraid. If I get a spare moment, I’ll try playing around with things a little more to see if I can get any more information. Thanks!

Just an observation from a run compiled with 1.10.6 + the patch provided by @jhenderson earlier in this discussion chain.

When the 2nd H5Dcreate call is moved to before the 1st H5Dwrite in @jrichardshaw’s test program, the error was gone. It looks like the problem occurs when calls to H5Dcreate and H5Dwrite are interleaved.

After reading into file H5C.c and adding a few printf statements, it appears that values of entry_ptr->coll_access checked in line 2271 are not consistent among the 4 running processes, which causes only 2 of the 4 processes calling MPI_Bcast at line 2297, and thus the error.

Just to follow up. This recent set of parameters in the Gist also fails on HDF5 1.12.0 (a separate discussion with some HDF5 staff suggested it might be fixed and got my hopes up). Crash output (although I think it’s largely the same):

MPI rank [0/4]
rank=0 creating file
MPI rank [1/4]
rank=1 creating file
MPI rank [2/4]
rank=2 creating file
MPI rank [3/4]
rank=3 creating file
rank=0 creating selection [0:4, 0:4194304]
rank=0 creating dataset1
rank=1 creating selection [4:8, 0:4194304]
rank=1 creating dataset1
rank=2 creating selection [8:12, 0:4194304]
rank=2 creating dataset1
rank=3 creating selection [12:16, 0:4194304]
rank=3 creating dataset1
rank=1 writing dataset1
rank=2 writing dataset1
rank=0 writing dataset1
rank=3 writing dataset1
rank=3 finished writing dataset1
rank=3 waiting at barrier
rank=0 finished writing dataset1
rank=0 waiting at barrier
rank=1 finished writing dataset1
rank=1 waiting at barrier
rank=0 creating dataset2
rank=2 finished writing dataset1
rank=2 waiting at barrier
rank=2 creating dataset2
rank=3 creating dataset2
rank=1 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-processHDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  # 2:
  #000: H5D.c line 151 in H5Dcreate2(): unable to create dataset
    major: 000: H5D.c line 151 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5VLcallback.c line 1869 in H5VL_dataset_create(): dataset create failed
    major:Dataset
    minor: Unable to initialize object
  #001: H5VLcallback.c line 1869 in H5VL_dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  # Virtual Object Layer
    minor: Unable to create file
  #002: H5VLcallback.c line 1835 in H5VL__dataset_create(): dataset create failed
    ma002: H5VLcallback.c line 1835 in H5VL__dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLnative_dataset.c line jor: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLnative_dataset.c line 75 in H5VL__native_dataset_create(): unable to create dataset
    major:75 in H5VL__native_dataset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #004 Dataset
    minor: Unable to initialize object
  #004: H5Dint.c line 411 in H5D__create_named: H5Dint.c line 411 in H5D__create_named(): unable to create and link
 to dataset
    major: Dataset
    m(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #inor: Unable to initialize object
  #005: H5L.c line 1804 in H5L_link_object(): 005: H5L.c line 1804 in H5L_link_object(): unable to create new link
to object
    major: unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #006: H5L.c lLinks
    minor: Unable to initialize object
  #006: H5L.c line 2045 in H5L__create_realine 2045 in H5L__create_real(): can't insert link
    major: Links
    minor(): can't insert link
    major: Links
    minor: Unable to insert object
  #007: Unable to insert object
  #007: H5Gtraverse.c line 855 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: H5Gtraverse.c line 855 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #008: : Object not found
  #008: H5Gtraverse.c line 585 in H5G__traverse_real(): can't look up component
    major:H5Gtraverse.c line 585 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #009:  Symbol table
    minor: Object not found
  #009: H5Gobj.c line 1125 in H5G__obj_lookup(): can't check for link info message
    majorH5Gobj.c line 1125 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    min: Symbol table
    minor: Can't get value
  #010: H5Gobj.c line 326 in H5G__obj_get_linfoor: Can't get value
  #010: H5Gobj.c line 326 in H5G__obj_get_linfo(): (): unable to read object header
    major: Symbol table
    minor: Can't get value
  #unable to read object header
    major: Symbol table
    minor: Can't get value
  #011: 011: H5Omessage.c line 883 in H5O_msg_exists(): unable to protect object header
    major: H5Omessage.c line 883 in H5O_msg_exists(): unable to protect object header
    major: Object header
Object header
    minor: Unable to protect metadata
  #012: H5Oint.c line 1082 i    minor: Unable to protect metadata
  #012: H5Oint.c line 1082 in n H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #  #013: H5AC.c line 1312 in H5AC_protect(): H5C_protect() failed
    majo013: H5AC.c line 1312 in H5AC_protect(): H5C_protect() failed
    major: r: Object cache
    minor: Unable to protect metadata
  #014: H5C.c line 2299 Object cache
    minor: Unable to protect metadata
  #014: H5C.c line 2299 inin H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed

  #015: H5C.c line 2299 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    maj#015: H5C.c line 2299 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: or: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=2 writing dataset2
rank=3 writing dataset2
Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 2:
  #000: H5Dio.c line 300 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 2:
  #000: H5D.c line 332 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  #000: H5Dio.c line 300 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  #000: H5D.c line 332 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything

Hi @jrichardshaw,

unfortunately there hasn’t been much time to look at this. However, we do know of some other folks that are looking for a fix to this issue as well. Based on @wkliao’s observation, I’m fairly certain that it’s just a problem of needing to insert barriers in the appropriate place in the library’s code. I remember having an issue reproducing this using your example, so I wasn’t quite able to determine if this really was the source of the issue, but I’m thinking that running several rounds of

H5Dcreate(...);
H5Dwrite(...);

should eventually produce the issue for me. In any case, I believe there should be more info on this issue relatively soon.

Hi again @jrichardshaw, @wkliao and others in this thread. I’ve narrowed down the cause of this issue and will have a small patch to post after I’ve discussed the fix with other developers. Provided that that patch works here and doesn’t cause further issues, we should be able to get the fix in quickly afterwards.

Wonderful. Thanks @jhenderson! I’ll be happy to test the patch whenever you post it.

Hi @jrichardshaw and @wkliao,

attached is a small patch against the 1.12 branch that temporarily disables the collective metadata reads feature in HDF5, which should make the issue disappear for now. However, this is only a temporary fix and may potentially affect performance. The issue stems from an oversight in the design of the collective metadata reads feature that has effectively been masked until recently and it will need to be fixed. While this feature wasn’t specifically enabled in your test program, there are some cases in the library where we implicitly turn the feature on due to metadata modifications needing to be collective, such as for H5Dcreate. That behavior, combined with your chosen chunk size and number of chunks was right on the line needed to cause the issue to appear. The timeline on fixing this correctly isn’t clear yet, but we hope to be able to fix this in time for the next release of HDF5.

disable_coll_md_reads.patch (480 Bytes)

1 Like

Thanks for the path @jhenderson. We’ve been testing the patch but we’re still having failures. One of my colleagues has posted a fuller description (post is awaiting approval), but what we’re finding is that it works will the nominal test case above, but if we go back to the first set of parameters (CHUNK1=32768; NCHUNK1=32), that it hangs. This seems more similar to the first issue found in this thread.

Anyway, I think my colleagues pending post has more details (including stack traces), so I won’t try and repeat them here.

1 Like

Thanks for the latest patch @jhenderson
I applied it to both the HEAD of the hdf5_1_12 branch as well as the tag hdf5-1_12_0.
Unfortunately the minimal test supplied by @jrichardshaw still hangs if built against these two if I uncomment

// Equivalent to original gist
// Works on 1.10.5 with patch, crashes on 1.10.5 vanilla and hangs on 1.10.6
#define CHUNK1 32768
#define NCHUNK1 32

This is the stack trace I got using tmpi 4 gdb ./testh5:

#0  0x00002aaaab49e9a7 in PMPI_Type_size_x ()                                                                                               │#0  0x00002aaaab49e994 in PMPI_Type_size_x ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#1  0x00002aaaab52d0f3 in ADIOI_GEN_WriteContig ()                                                                                          │#1  0x00002aaaab52d0f3 in ADIOI_GEN_WriteContig ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#2  0x00002aaaab531323 in ADIOI_GEN_WriteStrided ()                                                                                         │#2  0x00002aaaab531323 in ADIOI_GEN_WriteStrided ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#3  0x00002aaaab52faab in ADIOI_GEN_WriteStridedColl ()                                                                                     │#3  0x00002aaaab52faab in ADIOI_GEN_WriteStridedColl ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#4  0x00002aaaab544fac in MPIOI_File_write_all ()                                                                                           │#4  0x00002aaaab544fac in MPIOI_File_write_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#5  0x00002aaaab545531 in mca_io_romio_dist_MPI_File_write_at_all ()                                                                        │#5  0x00002aaaab545531 in mca_io_romio_dist_MPI_File_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#6  0x00002aaaab514922 in mca_io_romio321_file_write_at_all ()                                                                              │#6  0x00002aaaab514922 in mca_io_romio321_file_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#7  0x00002aaaab4848a8 in PMPI_File_write_at_all ()                                                                                         │#7  0x00002aaaab4848a8 in PMPI_File_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#8  0x000000000073d5a5 in H5FD__mpio_write (_file=0xceec90, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=3688, size=<optimized out>,   │#8  0x000000000073d5a5 in H5FD__mpio_write (_file=0xceec30, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=16780904, 
    buf=0x2aaaba5fb010) at H5FDmpio.c:1466                                                                                                  │    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1466
#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceec90, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688, size=size@entry=1,     │#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceec30, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, size=size@entry=1, 
    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248                                                                                          │    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248
#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf02d0, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688,          │#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf02d0, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, 
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826                                                                      │    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826
#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688, size=size@entry=1,     │#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, size=size@entry=1, 
    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031                                                                                            │    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031
#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=3688, size=size@entry=1,               │#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=16780904, size=size@entry=1, 
    buf=0x2aaaba5fb010) at H5Fio.c:205                                                                                                      │    buf=0x2aaaba5fb010) at H5Fio.c:205
#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1,                       │#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1, 
    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490                                                                 │    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490
#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd70760, mpi_buf_type=0xd717a0) at H5Dmpio.c:2124                                   │    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f8d0, mpi_buf_type=0xd70910) at H5Dmpio.c:2124
#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,    │#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    fm=fm@entry=0xd110c0, sum_chunk=<optimized out>) at H5Dmpio.c:1234                                                                      │    fm=fm@entry=0xd10800, sum_chunk=<optimized out>) at H5Dmpio.c:1234
#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    fm=fm@entry=0xd110c0) at H5Dmpio.c:883                                                                                                  │    fm=fm@entry=0xd10800) at H5Dmpio.c:883
#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>,            │#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>, 
    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd110c0) at H5Dmpio.c:960                                                    │    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10800) at H5Dmpio.c:960
#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4db0, mem_type_id=mem_type_id@entry=216172782113783850,                     │#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf46e0, mem_type_id=mem_type_id@entry=216172782113783850, mem_space=0xce4fd0, 
    mem_space=0xce5050, file_space=0xce2f40, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780                                  │    file_space=0xce2ec0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780
#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4db0, mem_type_id=216172782113783850, mem_space_id=288230376151711748,        │#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf46e0, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206                     │    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206
#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4db0, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850,                │#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf46e0, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151                                           │    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151
#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf4c50, mem_type_id=mem_type_id@entry=216172782113783850,             │#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf4580, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185                                 │    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185
#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748,               │#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313                                        │    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313
#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98                                                               │#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#0  0x00002aaab8220b46 in psm2_mq_ipeek2 () from /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/lib/libpsm2.so.2                   │#0  0x00002aaaabf92de8 in opal_progress ()
#1  0x00002aaab8002409 in ompi_mtl_psm2_progress ()                                                                                         │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libopen-pal.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/openmpi/mca_mtl_psm2.so                   │#1  0x00002aaaab45f435 in ompi_request_default_wait ()
#2  0x00002aaaabf92e0b in opal_progress ()                                                                                                  │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libopen-pal.so.40                         │#2  0x00002aaaab4bf303 in ompi_coll_base_sendrecv_actual ()
#3  0x00002aaaab45f435 in ompi_request_default_wait ()                                                                                      │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#3  0x00002aaaab4bf739 in ompi_coll_base_allreduce_intra_recursivedoubling ()
#4  0x00002aaaab4bf303 in ompi_coll_base_sendrecv_actual ()                                                                                 │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#4  0x00002aaaab4735b8 in PMPI_Allreduce ()
#5  0x00002aaaab4bf739 in ompi_coll_base_allreduce_intra_recursivedoubling ()                                                               │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#5  0x00002aaaab543afc in mca_io_romio_dist_MPI_File_set_view ()
#6  0x00002aaaab4735b8 in PMPI_Allreduce ()                                                                                                 │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#6  0x00002aaaab5139ab in mca_io_romio321_file_set_view ()
#7  0x00002aaaab543afc in mca_io_romio_dist_MPI_File_set_view ()                                                                            │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#7  0x00002aaaab483d68 in PMPI_File_set_view ()
#8  0x00002aaaab5139ab in mca_io_romio321_file_set_view ()                                                                                  │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#8  0x000000000073d5cd in H5FD__mpio_write (_file=0xceeb70, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=50340568, 
#9  0x00002aaaab483d68 in PMPI_File_set_view ()                                                                                             │    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1481
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceeb70, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, size=size@entry=1, 
#10 0x000000000073d5cd in H5FD__mpio_write (_file=0xceeb90, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=33558120,                     │    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248
    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1481                                                                            │#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf0260, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, 
#11 0x00000000004f2413 in H5FD_write (file=file@entry=0xceeb90, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,                    │    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248                                                                       │#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf0260, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, size=size@entry=1, 
#12 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf0270, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,      │    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826                                                                      │#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf0260, type=type@entry=H5FD_MEM_DRAW, addr=50340568, size=size@entry=1, 
#13 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf0270, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,                    │    buf=0x2aaaba5fb010) at H5Fio.c:205
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031                                                                         │#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1, 
#14 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf0270, type=type@entry=H5FD_MEM_DRAW, addr=33558120, size=size@entry=1,           │    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490
    buf=0x2aaaba5fb010) at H5Fio.c:205                                                                                                      │#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#15 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1,                       │    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f7f0, mpi_buf_type=0xd70830) at H5Dmpio.c:2124
    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490                                                                 │#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#16 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │    fm=fm@entry=0xd10780, sum_chunk=<optimized out>) at H5Dmpio.c:1234
    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f800, mpi_buf_type=0xd70840) at H5Dmpio.c:2124                                   │#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#17 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,    │    fm=fm@entry=0xd10780) at H5Dmpio.c:883
    fm=fm@entry=0xd10790, sum_chunk=<optimized out>) at H5Dmpio.c:1234                                                                      │#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>, 
#18 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,
    fm=fm@entry=0xd10790) at H5Dmpio.c:883
#19 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>,           
    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10790) at H5Dmpio.c:960                                                    │    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10780) at H5Dmpio.c:960
#20 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4630, mem_type_id=mem_type_id@entry=216172782113783850,                     │#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4620, mem_type_id=mem_type_id@entry=216172782113783850, mem_space=0xce4ff0, 
    mem_space=0xce4ff0, file_space=0xce2ee0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780                                  │    file_space=0xce2ee0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780
#21 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4630, mem_type_id=216172782113783850, mem_space_id=288230376151711748,        │#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4620, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206                     │    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206
#22 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4630, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850,                │#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4620, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151                                           │    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151
#23 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf44d0, mem_type_id=mem_type_id@entry=216172782113783850,             │#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf44c0, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185                                 │    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185
#24 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748,               │#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313                                        │    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313
#25 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98                                                               │#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98

After we found out that we can work around the crash mentioned above by setting the OpenMPI backend to ompio and that we have to use a release build of HDF5 to get around the crash described in Crash when freeing user-provided buffer on filter callback, we found that the minimal test also crashes if we set

#define _COMPRESS
#define CHUNK1 256
#define NCHUNK1 8192

This is the 1.12 branch with the patch from @jhenderson both as a debug and release build:

rank=1 writing dataset2
rank=3 writing dataset2
rank=2 writing dataset2
rank=0 writing dataset2
[cdr1042:149555:0:149555] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1211e95c)
==== backtrace ====rank=1 writing dataset2
rank=3 writing dataset2
rank=2 writing dataset2
rank=0 writing dataset2
[cdr1042:149555:0:149555] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1211e95c)
==== backtrace ====
 0 0x0000000000033280 killpg()  ???:0
 1 0x0000000000145c24 __memcpy_avx512_unaligned_erms()  ???:0
 2 0x000000000006ac5c opal_generic_simple_pack()  ???:0
 3 0x00000000000040cf ompi_mtl_psm2_isend()  ???:0
 4 0x00000000001c772b mca_pml_cm_isend()  pml_cm.c:0
 5 0x000000000011a1ef shuffle_init.isra.1()  fcoll_dynamic_gen2_file_write_all.c:0
 6 0x000000000011c11b mca_fcoll_dynamic_gen2_file_write_all()  ???:0
 7 0x00000000000bcd7e mca_common_ompio_file_write_at_all()  ???:0
 8 0x0000000000159b96 mca_io_ompio_file_write_at_all()  ???:0
 9 0x00000000000958a8 PMPI_File_write_at_all()  ???:0
10 0x000000000073d5b6 H5FD__mpio_write()  /scratch/rickn/hdf5/src/H5FDmpio.c:1466
11 0x00000000004f2424 H5FD_write()  /scratch/rickn/hdf5/src/H5FDint.c:248
12 0x000000000077eeb6 H5F__accum_write()  /scratch/rickn/hdf5/src/H5Faccum.c:826
13 0x00000000005ef5c8 H5PB_write()  /scratch/rickn/hdf5/src/H5PB.c:1031
14 0x00000000004d92d0 H5F_block_write()  /scratch/rickn/hdf5/src/H5Fio.c:251
15 0x000000000044d0ea H5C__flush_single_entry()  /scratch/rickn/hdf5/src/H5C.c:6109
16 0x000000000072d01b H5C__flush_candidates_in_ring()  /scratch/rickn/hdf5/src/H5Cmpio.c:1372
17 0x000000000072d989 H5C__flush_candidate_entries()  /scratch/rickn/hdf5/src/H5Cmpio.c:1193
18 0x000000000072f603 H5C_apply_candidate_list()  /scratch/rickn/hdf5/src/H5Cmpio.c:386
19 0x000000000072ace3 H5AC__propagate_and_apply_candidate_list()  /scratch/rickn/hdf5/src/H5ACmpio.c:1276
20 0x000000000072af40 H5AC__rsp__dist_md_write__flush_to_min_clean()  /scratch/rickn/hdf5/src/H5ACmpio.c:1835
21 0x000000000072cc0c H5AC__run_sync_point()  /scratch/rickn/hdf5/src/H5ACmpio.c:2157
22 0x0000000000422a89 H5AC_unprotect()  /scratch/rickn/hdf5/src/H5AC.c:1568
23 0x000000000075006b H5B__insert_helper()  /scratch/rickn/hdf5/src/H5B.c:1101
24 0x00000000007507fc H5B__insert_helper()  /scratch/rickn/hdf5/src/H5B.c:998
25 0x0000000000750e1f H5B_insert()  /scratch/rickn/hdf5/src/H5B.c:596
26 0x0000000000753dde H5D__btree_idx_insert()  /scratch/rickn/hdf5/src/H5Dbtree.c:1009
27 0x0000000000735772 H5D__link_chunk_filtered_collective_io()  /scratch/rickn/hdf5/src/H5Dmpio.c:1462
28 0x0000000000739abe H5D__chunk_collective_io()  /scratch/rickn/hdf5/src/H5Dmpio.c:878
29 0x000000000073a52a H5D__chunk_collective_write()  /scratch/rickn/hdf5/src/H5Dmpio.c:960
30 0x00000000004955bd H5D__write()  /scratch/rickn/hdf5/src/H5Dio.c:780
31 0x00000000007038e9 H5VL__native_dataset_write()  /scratch/rickn/hdf5/src/H5VLnative_dataset.c:206
32 0x00000000006e36f3 H5VL__dataset_write()  /scratch/rickn/hdf5/src/H5VLcallback.c:2151
33 0x00000000006ecab6 H5VL_dataset_write()  /scratch/rickn/hdf5/src/H5VLcallback.c:2185
34 0x0000000000493da0 H5Dwrite()  /scratch/rickn/hdf5/src/H5Dio.c:313
35 0x0000000000404183 main()  /scratch/rickn/test_hdf5/test_orig.c:111
36 0x00000000000202e0 __libc_start_main()  ???:0
37 0x0000000000403c5a _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================

With the release of hdf5 1.10.7, I wanted to run the tests again to see if anything changed. I found that the minimal test from the top of this thread fails both with ompio and romio321:

$mpirun -np 4 --mca io ompio ./testh5
MPI rank [0/4]
rank=0 creating file
MPI rank [1/4]
rank=1 creating file
MPI rank [2/4]
rank=2 creating file
MPI rank [3/4]
rank=3 creating file
rank=0 creating selection [0:4, 0:4194304]
rank=1 creating selection [4:8, 0:4194304]
rank=2 creating selection [8:12, 0:4194304]
rank=3 creating selection [12:16, 0:4194304]
rank=2 creating dataset1
rank=0 creating dataset1
rank=1 creating dataset1
rank=3 creating dataset1
rank=0 writing dataset1
rank=2 writing dataset1
rank=3 writing dataset1
rank=1 writing dataset1
rank=2 finished writing dataset1
rank=2 creating dataset2
rank=0 finished writing dataset1
rank=0 creating dataset2
rank=3 finished writing dataset1
rank=3 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5D.c line 152 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 338 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1605 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1846 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009: H5Oint.c line 1056 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #010: H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c line 2454 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2454 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=2 writing dataset2
rank=1 finished writing dataset1
rank=1 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5D.c line 152 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 338 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1605 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1846 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009: H5Oint.c line 1056 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #010: H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c line 2454 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2454 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=3 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5Dio.c line 313 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5D.c line 334 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5Dio.c line 313 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5D.c line 334 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything
``

Hi again @jrichardshaw, @rick and @wkliao,

after an unfortunately long (2 years!) time, I’ve finally been able to get back to looking at this part of HDF5 and was able to come up with a fix specifically for the cases above where an MPI_ERR_TRUNCATE issue/crash was happening. With this fix in place, I am able to run the small example in this thread under HDF5 1.13, 1.12 and 1.10 with all chunking parameters available in the example program and with compression both enabled and disabled. This fix is being merged to the development branches for HDF5 1.13, 1.12 and 1.10.

While this is hopefully good news, there are two concerns I have in regards to this fix and the overall parallel compression feature that I hope others may be able to comment on or help with:

  1. I never encountered the MPI_ERR_TRUNCATE issue with compression enabled, only with it disabled. With compression disabled, the example in this thread always seemed to run fine for me. However, one of @jrichardshaw’s earlier posts seemed to hint that an error was still encountered with certain chunking parameters and compression enabled. If this is still the case after this fix, I’d definitely be interested in knowing about it since I’m actively working on the parallel compression feature.

  2. There seem to be a few other types of crashes reported in this thread that appear unrelated to the MPI_ERR_TRUNCATE issue. While at least one appears to have been related to the I/O backend that MPI was using, it is unclear whether all of the other issues have been resolved. So again, if these issues are still encountered after my fix is in place, please let me know so we can try to figure out the issue.

1 Like

Hi @jhenderson !

I have been trying to write large 3D array of the shape (3, Superlarge, 4000) in parallel where I split Superlarge into digestible chunks. Superlarge could be >10,000,000. For smaller arrays I do not get this address not mapped to object at address error, only for large arrays. Are there any limitations on the compressed, chunked, parallel writing that I’m not aware of?

Thanks!

Cheers,

Lucas