Crash when writing parallel compressed chunks


#21

Just an observation from a run compiled with 1.10.6 + the patch provided by @jhenderson earlier in this discussion chain.

When the 2nd H5Dcreate call is moved to before the 1st H5Dwrite in @jrichardshaw’s test program, the error was gone. It looks like the problem occurs when calls to H5Dcreate and H5Dwrite are interleaved.

After reading into file H5C.c and adding a few printf statements, it appears that values of entry_ptr->coll_access checked in line 2271 are not consistent among the 4 running processes, which causes only 2 of the 4 processes calling MPI_Bcast at line 2297, and thus the error.


#22

Just to follow up. This recent set of parameters in the Gist also fails on HDF5 1.12.0 (a separate discussion with some HDF5 staff suggested it might be fixed and got my hopes up). Crash output (although I think it’s largely the same):

MPI rank [0/4]
rank=0 creating file
MPI rank [1/4]
rank=1 creating file
MPI rank [2/4]
rank=2 creating file
MPI rank [3/4]
rank=3 creating file
rank=0 creating selection [0:4, 0:4194304]
rank=0 creating dataset1
rank=1 creating selection [4:8, 0:4194304]
rank=1 creating dataset1
rank=2 creating selection [8:12, 0:4194304]
rank=2 creating dataset1
rank=3 creating selection [12:16, 0:4194304]
rank=3 creating dataset1
rank=1 writing dataset1
rank=2 writing dataset1
rank=0 writing dataset1
rank=3 writing dataset1
rank=3 finished writing dataset1
rank=3 waiting at barrier
rank=0 finished writing dataset1
rank=0 waiting at barrier
rank=1 finished writing dataset1
rank=1 waiting at barrier
rank=0 creating dataset2
rank=2 finished writing dataset1
rank=2 waiting at barrier
rank=2 creating dataset2
rank=3 creating dataset2
rank=1 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-processHDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  # 2:
  #000: H5D.c line 151 in H5Dcreate2(): unable to create dataset
    major: 000: H5D.c line 151 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5VLcallback.c line 1869 in H5VL_dataset_create(): dataset create failed
    major:Dataset
    minor: Unable to initialize object
  #001: H5VLcallback.c line 1869 in H5VL_dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  # Virtual Object Layer
    minor: Unable to create file
  #002: H5VLcallback.c line 1835 in H5VL__dataset_create(): dataset create failed
    ma002: H5VLcallback.c line 1835 in H5VL__dataset_create(): dataset create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLnative_dataset.c line jor: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLnative_dataset.c line 75 in H5VL__native_dataset_create(): unable to create dataset
    major:75 in H5VL__native_dataset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #004 Dataset
    minor: Unable to initialize object
  #004: H5Dint.c line 411 in H5D__create_named: H5Dint.c line 411 in H5D__create_named(): unable to create and link
 to dataset
    major: Dataset
    m(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #inor: Unable to initialize object
  #005: H5L.c line 1804 in H5L_link_object(): 005: H5L.c line 1804 in H5L_link_object(): unable to create new link
to object
    major: unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #006: H5L.c lLinks
    minor: Unable to initialize object
  #006: H5L.c line 2045 in H5L__create_realine 2045 in H5L__create_real(): can't insert link
    major: Links
    minor(): can't insert link
    major: Links
    minor: Unable to insert object
  #007: Unable to insert object
  #007: H5Gtraverse.c line 855 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: H5Gtraverse.c line 855 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #008: : Object not found
  #008: H5Gtraverse.c line 585 in H5G__traverse_real(): can't look up component
    major:H5Gtraverse.c line 585 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #009:  Symbol table
    minor: Object not found
  #009: H5Gobj.c line 1125 in H5G__obj_lookup(): can't check for link info message
    majorH5Gobj.c line 1125 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    min: Symbol table
    minor: Can't get value
  #010: H5Gobj.c line 326 in H5G__obj_get_linfoor: Can't get value
  #010: H5Gobj.c line 326 in H5G__obj_get_linfo(): (): unable to read object header
    major: Symbol table
    minor: Can't get value
  #unable to read object header
    major: Symbol table
    minor: Can't get value
  #011: 011: H5Omessage.c line 883 in H5O_msg_exists(): unable to protect object header
    major: H5Omessage.c line 883 in H5O_msg_exists(): unable to protect object header
    major: Object header
Object header
    minor: Unable to protect metadata
  #012: H5Oint.c line 1082 i    minor: Unable to protect metadata
  #012: H5Oint.c line 1082 in n H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #  #013: H5AC.c line 1312 in H5AC_protect(): H5C_protect() failed
    majo013: H5AC.c line 1312 in H5AC_protect(): H5C_protect() failed
    major: r: Object cache
    minor: Unable to protect metadata
  #014: H5C.c line 2299 Object cache
    minor: Unable to protect metadata
  #014: H5C.c line 2299 inin H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed

  #015: H5C.c line 2299 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    maj#015: H5C.c line 2299 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: or: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=2 writing dataset2
rank=3 writing dataset2
Internal error (too specific to document in detail)
    minor: MPI Error String
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 2:
  #000: H5Dio.c line 300 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 2:
  #000: H5D.c line 332 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  #000: H5Dio.c line 300 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 3:
  #000: H5D.c line 332 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything

#23

Hi @jrichardshaw,

unfortunately there hasn’t been much time to look at this. However, we do know of some other folks that are looking for a fix to this issue as well. Based on @wkliao’s observation, I’m fairly certain that it’s just a problem of needing to insert barriers in the appropriate place in the library’s code. I remember having an issue reproducing this using your example, so I wasn’t quite able to determine if this really was the source of the issue, but I’m thinking that running several rounds of

H5Dcreate(...);
H5Dwrite(...);

should eventually produce the issue for me. In any case, I believe there should be more info on this issue relatively soon.


#24

Hi again @jrichardshaw, @wkliao and others in this thread. I’ve narrowed down the cause of this issue and will have a small patch to post after I’ve discussed the fix with other developers. Provided that that patch works here and doesn’t cause further issues, we should be able to get the fix in quickly afterwards.


#25

Wonderful. Thanks @jhenderson! I’ll be happy to test the patch whenever you post it.


#26

Hi @jrichardshaw and @wkliao,

attached is a small patch against the 1.12 branch that temporarily disables the collective metadata reads feature in HDF5, which should make the issue disappear for now. However, this is only a temporary fix and may potentially affect performance. The issue stems from an oversight in the design of the collective metadata reads feature that has effectively been masked until recently and it will need to be fixed. While this feature wasn’t specifically enabled in your test program, there are some cases in the library where we implicitly turn the feature on due to metadata modifications needing to be collective, such as for H5Dcreate. That behavior, combined with your chosen chunk size and number of chunks was right on the line needed to cause the issue to appear. The timeline on fixing this correctly isn’t clear yet, but we hope to be able to fix this in time for the next release of HDF5.

disable_coll_md_reads.patch (480 Bytes)


#27

Thanks for the path @jhenderson. We’ve been testing the patch but we’re still having failures. One of my colleagues has posted a fuller description (post is awaiting approval), but what we’re finding is that it works will the nominal test case above, but if we go back to the first set of parameters (CHUNK1=32768; NCHUNK1=32), that it hangs. This seems more similar to the first issue found in this thread.

Anyway, I think my colleagues pending post has more details (including stack traces), so I won’t try and repeat them here.


#28

Thanks for the latest patch @jhenderson
I applied it to both the HEAD of the hdf5_1_12 branch as well as the tag hdf5-1_12_0.
Unfortunately the minimal test supplied by @jrichardshaw still hangs if built against these two if I uncomment

// Equivalent to original gist
// Works on 1.10.5 with patch, crashes on 1.10.5 vanilla and hangs on 1.10.6
#define CHUNK1 32768
#define NCHUNK1 32

This is the stack trace I got using tmpi 4 gdb ./testh5:

#0  0x00002aaaab49e9a7 in PMPI_Type_size_x ()                                                                                               │#0  0x00002aaaab49e994 in PMPI_Type_size_x ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#1  0x00002aaaab52d0f3 in ADIOI_GEN_WriteContig ()                                                                                          │#1  0x00002aaaab52d0f3 in ADIOI_GEN_WriteContig ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#2  0x00002aaaab531323 in ADIOI_GEN_WriteStrided ()                                                                                         │#2  0x00002aaaab531323 in ADIOI_GEN_WriteStrided ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#3  0x00002aaaab52faab in ADIOI_GEN_WriteStridedColl ()                                                                                     │#3  0x00002aaaab52faab in ADIOI_GEN_WriteStridedColl ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#4  0x00002aaaab544fac in MPIOI_File_write_all ()                                                                                           │#4  0x00002aaaab544fac in MPIOI_File_write_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#5  0x00002aaaab545531 in mca_io_romio_dist_MPI_File_write_at_all ()                                                                        │#5  0x00002aaaab545531 in mca_io_romio_dist_MPI_File_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#6  0x00002aaaab514922 in mca_io_romio321_file_write_at_all ()                                                                              │#6  0x00002aaaab514922 in mca_io_romio321_file_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#7  0x00002aaaab4848a8 in PMPI_File_write_at_all ()                                                                                         │#7  0x00002aaaab4848a8 in PMPI_File_write_at_all ()
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#8  0x000000000073d5a5 in H5FD__mpio_write (_file=0xceec90, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=3688, size=<optimized out>,   │#8  0x000000000073d5a5 in H5FD__mpio_write (_file=0xceec30, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=16780904, 
    buf=0x2aaaba5fb010) at H5FDmpio.c:1466                                                                                                  │    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1466
#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceec90, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688, size=size@entry=1,     │#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceec30, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, size=size@entry=1, 
    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248                                                                                          │    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248
#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf02d0, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688,          │#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf02d0, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, 
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826                                                                      │    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826
#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=3688, size=size@entry=1,     │#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=16780904, size=size@entry=1, 
    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031                                                                                            │    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031
#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=3688, size=size@entry=1,               │#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf02d0, type=type@entry=H5FD_MEM_DRAW, addr=16780904, size=size@entry=1, 
    buf=0x2aaaba5fb010) at H5Fio.c:205                                                                                                      │    buf=0x2aaaba5fb010) at H5Fio.c:205
#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1,                       │#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1, 
    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490                                                                 │    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490
#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd70760, mpi_buf_type=0xd717a0) at H5Dmpio.c:2124                                   │    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f8d0, mpi_buf_type=0xd70910) at H5Dmpio.c:2124
#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,    │#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    fm=fm@entry=0xd110c0, sum_chunk=<optimized out>) at H5Dmpio.c:1234                                                                      │    fm=fm@entry=0xd10800, sum_chunk=<optimized out>) at H5Dmpio.c:1234
#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
    fm=fm@entry=0xd110c0) at H5Dmpio.c:883                                                                                                  │    fm=fm@entry=0xd10800) at H5Dmpio.c:883
#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>,            │#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>, 
    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd110c0) at H5Dmpio.c:960                                                    │    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10800) at H5Dmpio.c:960
#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4db0, mem_type_id=mem_type_id@entry=216172782113783850,                     │#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf46e0, mem_type_id=mem_type_id@entry=216172782113783850, mem_space=0xce4fd0, 
    mem_space=0xce5050, file_space=0xce2f40, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780                                  │    file_space=0xce2ec0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780
#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4db0, mem_type_id=216172782113783850, mem_space_id=288230376151711748,        │#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf46e0, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206                     │    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206
#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4db0, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850,                │#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf46e0, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151                                           │    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151
#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf4c50, mem_type_id=mem_type_id@entry=216172782113783850,             │#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf4580, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185                                 │    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185
#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748,               │#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313                                        │    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313
#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98                                                               │#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#0  0x00002aaab8220b46 in psm2_mq_ipeek2 () from /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/lib/libpsm2.so.2                   │#0  0x00002aaaabf92de8 in opal_progress ()
#1  0x00002aaab8002409 in ompi_mtl_psm2_progress ()                                                                                         │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libopen-pal.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/openmpi/mca_mtl_psm2.so                   │#1  0x00002aaaab45f435 in ompi_request_default_wait ()
#2  0x00002aaaabf92e0b in opal_progress ()                                                                                                  │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libopen-pal.so.40                         │#2  0x00002aaaab4bf303 in ompi_coll_base_sendrecv_actual ()
#3  0x00002aaaab45f435 in ompi_request_default_wait ()                                                                                      │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#3  0x00002aaaab4bf739 in ompi_coll_base_allreduce_intra_recursivedoubling ()
#4  0x00002aaaab4bf303 in ompi_coll_base_sendrecv_actual ()                                                                                 │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#4  0x00002aaaab4735b8 in PMPI_Allreduce ()
#5  0x00002aaaab4bf739 in ompi_coll_base_allreduce_intra_recursivedoubling ()                                                               │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#5  0x00002aaaab543afc in mca_io_romio_dist_MPI_File_set_view ()
#6  0x00002aaaab4735b8 in PMPI_Allreduce ()                                                                                                 │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#6  0x00002aaaab5139ab in mca_io_romio321_file_set_view ()
#7  0x00002aaaab543afc in mca_io_romio_dist_MPI_File_set_view ()                                                                            │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#7  0x00002aaaab483d68 in PMPI_File_set_view ()
#8  0x00002aaaab5139ab in mca_io_romio321_file_set_view ()                                                                                  │   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#8  0x000000000073d5cd in H5FD__mpio_write (_file=0xceeb70, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=50340568, 
#9  0x00002aaaab483d68 in PMPI_File_set_view ()                                                                                             │    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1481
   from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40                              │#9  0x00000000004f2413 in H5FD_write (file=file@entry=0xceeb70, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, size=size@entry=1, 
#10 0x000000000073d5cd in H5FD__mpio_write (_file=0xceeb90, type=H5FD_MEM_DRAW, dxpl_id=<optimized out>, addr=33558120,                     │    buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248
    size=<optimized out>, buf=0x2aaaba5fb010) at H5FDmpio.c:1481                                                                            │#10 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf0260, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, 
#11 0x00000000004f2413 in H5FD_write (file=file@entry=0xceeb90, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,                    │    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5FDint.c:248                                                                       │#11 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf0260, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=50340568, size=size@entry=1, 
#12 0x000000000077eea5 in H5F__accum_write (f_sh=f_sh@entry=0xcf0270, map_type=map_type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,      │    buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5Faccum.c:826                                                                      │#12 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf0260, type=type@entry=H5FD_MEM_DRAW, addr=50340568, size=size@entry=1, 
#13 0x00000000005ef5b7 in H5PB_write (f_sh=f_sh@entry=0xcf0270, type=type@entry=H5FD_MEM_DRAW, addr=addr@entry=33558120,                    │    buf=0x2aaaba5fb010) at H5Fio.c:205
    size=size@entry=1, buf=buf@entry=0x2aaaba5fb010) at H5PB.c:1031                                                                         │#13 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1, 
#14 0x00000000004d9079 in H5F_shared_block_write (f_sh=0xcf0270, type=type@entry=H5FD_MEM_DRAW, addr=33558120, size=size@entry=1,           │    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490
    buf=0x2aaaba5fb010) at H5Fio.c:205                                                                                                      │#14 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#15 0x000000000073a113 in H5D__mpio_select_write (io_info=0x7fffffff82e0, type_info=<optimized out>, mpi_buf_count=1,                       │    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f7f0, mpi_buf_type=0xd70830) at H5Dmpio.c:2124
    file_space=<optimized out>, mem_space=<optimized out>) at H5Dmpio.c:490                                                                 │#15 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#16 0x0000000000730e2b in H5D__final_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,         │    fm=fm@entry=0xd10780, sum_chunk=<optimized out>) at H5Dmpio.c:1234
    mpi_buf_count=mpi_buf_count@entry=1, mpi_file_type=0xd6f800, mpi_buf_type=0xd70840) at H5Dmpio.c:2124                                   │#16 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260, 
#17 0x0000000000736129 in H5D__link_chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,    │    fm=fm@entry=0xd10780) at H5Dmpio.c:883
    fm=fm@entry=0xd10790, sum_chunk=<optimized out>) at H5Dmpio.c:1234                                                                      │#17 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>, 
#18 0x0000000000739b11 in H5D__chunk_collective_io (io_info=io_info@entry=0x7fffffff82e0, type_info=type_info@entry=0x7fffffff8260,
    fm=fm@entry=0xd10790) at H5Dmpio.c:883
#19 0x000000000073a519 in H5D__chunk_collective_write (io_info=0x7fffffff82e0, type_info=0x7fffffff8260, nelmts=<optimized out>,           
    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10790) at H5Dmpio.c:960                                                    │    file_space=<optimized out>, mem_space=<optimized out>, fm=0xd10780) at H5Dmpio.c:960
#20 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4630, mem_type_id=mem_type_id@entry=216172782113783850,                     │#18 0x00000000004955ac in H5D__write (dataset=dataset@entry=0xcf4620, mem_type_id=mem_type_id@entry=216172782113783850, mem_space=0xce4ff0, 
    mem_space=0xce4ff0, file_space=0xce2ee0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780                                  │    file_space=0xce2ee0, buf=<optimized out>, buf@entry=0x2aaaba5fb010) at H5Dio.c:780
#21 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4630, mem_type_id=216172782113783850, mem_space_id=288230376151711748,        │#19 0x00000000007038d8 in H5VL__native_dataset_write (obj=0xcf4620, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206                     │    file_space_id=288230376151711747, dxpl_id=<optimized out>, buf=0x2aaaba5fb010, req=0x0) at H5VLnative_dataset.c:206
#22 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4630, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850,                │#20 0x00000000006e36e2 in H5VL__dataset_write (obj=0xcf4620, cls=0xac3520, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151                                           │    dxpl_id=dxpl_id@entry=792633534417207318, buf=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2151
#23 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf44d0, mem_type_id=mem_type_id@entry=216172782113783850,             │#21 0x00000000006ecaa5 in H5VL_dataset_write (vol_obj=vol_obj@entry=0xcf44c0, mem_type_id=mem_type_id@entry=216172782113783850, 
    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747,                               │    mem_space_id=mem_space_id@entry=288230376151711748, file_space_id=file_space_id@entry=288230376151711747, 
    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185                                 │    dxpl_id=dxpl_id@entry=792633534417207318, buf=buf@entry=0x2aaaba5fb010, req=0x0) at H5VLcallback.c:2185
#24 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748,               │#22 0x0000000000493d8f in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113783850, mem_space_id=288230376151711748, 
    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313                                        │    file_space_id=288230376151711747, dxpl_id=792633534417207318, buf=0x2aaaba5fb010) at H5Dio.c:313
#25 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98                                                               │#23 0x0000000000404096 in main (argc=1, argv=0x7fffffff8728) at test_ph5.c:98

#29

After we found out that we can work around the crash mentioned above by setting the OpenMPI backend to ompio and that we have to use a release build of HDF5 to get around the crash described in Crash when freeing user-provided buffer on filter callback, we found that the minimal test also crashes if we set

#define _COMPRESS
#define CHUNK1 256
#define NCHUNK1 8192

This is the 1.12 branch with the patch from @jhenderson both as a debug and release build:

rank=1 writing dataset2
rank=3 writing dataset2
rank=2 writing dataset2
rank=0 writing dataset2
[cdr1042:149555:0:149555] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1211e95c)
==== backtrace ====rank=1 writing dataset2
rank=3 writing dataset2
rank=2 writing dataset2
rank=0 writing dataset2
[cdr1042:149555:0:149555] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1211e95c)
==== backtrace ====
 0 0x0000000000033280 killpg()  ???:0
 1 0x0000000000145c24 __memcpy_avx512_unaligned_erms()  ???:0
 2 0x000000000006ac5c opal_generic_simple_pack()  ???:0
 3 0x00000000000040cf ompi_mtl_psm2_isend()  ???:0
 4 0x00000000001c772b mca_pml_cm_isend()  pml_cm.c:0
 5 0x000000000011a1ef shuffle_init.isra.1()  fcoll_dynamic_gen2_file_write_all.c:0
 6 0x000000000011c11b mca_fcoll_dynamic_gen2_file_write_all()  ???:0
 7 0x00000000000bcd7e mca_common_ompio_file_write_at_all()  ???:0
 8 0x0000000000159b96 mca_io_ompio_file_write_at_all()  ???:0
 9 0x00000000000958a8 PMPI_File_write_at_all()  ???:0
10 0x000000000073d5b6 H5FD__mpio_write()  /scratch/rickn/hdf5/src/H5FDmpio.c:1466
11 0x00000000004f2424 H5FD_write()  /scratch/rickn/hdf5/src/H5FDint.c:248
12 0x000000000077eeb6 H5F__accum_write()  /scratch/rickn/hdf5/src/H5Faccum.c:826
13 0x00000000005ef5c8 H5PB_write()  /scratch/rickn/hdf5/src/H5PB.c:1031
14 0x00000000004d92d0 H5F_block_write()  /scratch/rickn/hdf5/src/H5Fio.c:251
15 0x000000000044d0ea H5C__flush_single_entry()  /scratch/rickn/hdf5/src/H5C.c:6109
16 0x000000000072d01b H5C__flush_candidates_in_ring()  /scratch/rickn/hdf5/src/H5Cmpio.c:1372
17 0x000000000072d989 H5C__flush_candidate_entries()  /scratch/rickn/hdf5/src/H5Cmpio.c:1193
18 0x000000000072f603 H5C_apply_candidate_list()  /scratch/rickn/hdf5/src/H5Cmpio.c:386
19 0x000000000072ace3 H5AC__propagate_and_apply_candidate_list()  /scratch/rickn/hdf5/src/H5ACmpio.c:1276
20 0x000000000072af40 H5AC__rsp__dist_md_write__flush_to_min_clean()  /scratch/rickn/hdf5/src/H5ACmpio.c:1835
21 0x000000000072cc0c H5AC__run_sync_point()  /scratch/rickn/hdf5/src/H5ACmpio.c:2157
22 0x0000000000422a89 H5AC_unprotect()  /scratch/rickn/hdf5/src/H5AC.c:1568
23 0x000000000075006b H5B__insert_helper()  /scratch/rickn/hdf5/src/H5B.c:1101
24 0x00000000007507fc H5B__insert_helper()  /scratch/rickn/hdf5/src/H5B.c:998
25 0x0000000000750e1f H5B_insert()  /scratch/rickn/hdf5/src/H5B.c:596
26 0x0000000000753dde H5D__btree_idx_insert()  /scratch/rickn/hdf5/src/H5Dbtree.c:1009
27 0x0000000000735772 H5D__link_chunk_filtered_collective_io()  /scratch/rickn/hdf5/src/H5Dmpio.c:1462
28 0x0000000000739abe H5D__chunk_collective_io()  /scratch/rickn/hdf5/src/H5Dmpio.c:878
29 0x000000000073a52a H5D__chunk_collective_write()  /scratch/rickn/hdf5/src/H5Dmpio.c:960
30 0x00000000004955bd H5D__write()  /scratch/rickn/hdf5/src/H5Dio.c:780
31 0x00000000007038e9 H5VL__native_dataset_write()  /scratch/rickn/hdf5/src/H5VLnative_dataset.c:206
32 0x00000000006e36f3 H5VL__dataset_write()  /scratch/rickn/hdf5/src/H5VLcallback.c:2151
33 0x00000000006ecab6 H5VL_dataset_write()  /scratch/rickn/hdf5/src/H5VLcallback.c:2185
34 0x0000000000493da0 H5Dwrite()  /scratch/rickn/hdf5/src/H5Dio.c:313
35 0x0000000000404183 main()  /scratch/rickn/test_hdf5/test_orig.c:111
36 0x00000000000202e0 __libc_start_main()  ???:0
37 0x0000000000403c5a _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================

#30

With the release of hdf5 1.10.7, I wanted to run the tests again to see if anything changed. I found that the minimal test from the top of this thread fails both with ompio and romio321:

$mpirun -np 4 --mca io ompio ./testh5
MPI rank [0/4]
rank=0 creating file
MPI rank [1/4]
rank=1 creating file
MPI rank [2/4]
rank=2 creating file
MPI rank [3/4]
rank=3 creating file
rank=0 creating selection [0:4, 0:4194304]
rank=1 creating selection [4:8, 0:4194304]
rank=2 creating selection [8:12, 0:4194304]
rank=3 creating selection [12:16, 0:4194304]
rank=2 creating dataset1
rank=0 creating dataset1
rank=1 creating dataset1
rank=3 creating dataset1
rank=0 writing dataset1
rank=2 writing dataset1
rank=3 writing dataset1
rank=1 writing dataset1
rank=2 finished writing dataset1
rank=2 creating dataset2
rank=0 finished writing dataset1
rank=0 creating dataset2
rank=3 finished writing dataset1
rank=3 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5D.c line 152 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 338 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1605 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1846 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009: H5Oint.c line 1056 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #010: H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c line 2454 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2454 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=2 writing dataset2
rank=1 finished writing dataset1
rank=1 creating dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5D.c line 152 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 338 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1605 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1846 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009: H5Oint.c line 1056 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #010: H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c line 2454 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2454 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=3 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5Dio.c line 313 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
  #000: H5D.c line 334 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5Dio.c line 313 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:
  #000: H5D.c line 334 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything
``