Crash when writing parallel compressed chunks

jrichardshaw · September 24, 2019, 3:02pm

I’m finding crashes when I try to write compressed datasets in parallel with the MPIO driver. I have produced a (fairly simple) test case to reproduce the issues I’m having. As I can’t upload attachments (new user) I’ve made a gist here: https://gist.github.com/jrs65/97e36592785d3db2729a8ed20521eaa6

The test does the following things:

Creates a file with the MPIO driver
Creates a 2D chunked dataset with gzip compression (chunks evenly split the dataset into 4x32 chunks)
Collectively writes random data into it (distributed across the first axis). This means each rank will write across one chunk in the first dimension and 32 whole chunks in the second.
Creates a second dataset (identical) except for the name
Writes more random data into the new dataset

When running on 4 MPI processes the first two ranks exit attempting step 4 with the traceback (I’ve trimmed to just the end):

  #015: H5Dchunk.c line 4727 in H5D__chunk_collective_fill(): unable to write raw data to file
    major: Low-level I/O
    minor: Write failed
  #016: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #017: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #018: H5Faccum.c line 826 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #019: H5FDint.c line 258 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #020: H5FDmpio.c line 1807 in H5FD_mpio_write(): MPI_File_set_view failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #021: H5FDmpio.c line 1807 in H5FD_mpio_write(): MPI_ERR_TYPE: invalid datatype
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
    minor: Write failed
  #019: H5FDint.c line 258 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #020: H5FDmpio.c line 1807 in H5FD_mpio_write(): MPI_File_set_view failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #021: H5FDmpio.c line 1807 in H5FD_mpio_write(): MPI_ERR_TYPE: invalid datatype
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

The crashes only occur under certain conditions, which are set in the test case.

There must be two datasets created, one is not enough.
The second dataset crashes only if an actual write is performed to the first.
The write must be large enough. By that I mean two things I’ve tested

The dataset larger than some size I haven’t well determined
The actual data written must be complex. i.e. writing zeros (which compress too well) does not cause a crash

The chunk size must be smaller than the write selection size by some undetermined amount. Reducing the size by powers of two, only causes a crash when the chunk size is 32x smaller than the total write selection size if the dataset size along the long dimension is 2^20, but 16x if it is 64.

I’ve tested this in HDF5 1.10.5 using OpenMPI 4.0.1, it crashes using either ompio or romio as the MPIO backends. I have reproduced this issue on both macOS and Linux.

The test case I’ve attached is in C. I originally found the issue when using h5py, and I found exactly the same issue using both the h5py high and low level interfaces. If it’s any use, I could supply those test cases too, though I think I’ve faithfully translated it into the C API.

I think this is a bug, although it’s possible I’m attempting an IO pattern which is not supported and is not caught cleanly, in which case I’d very much appreciate any guidance about what is possible, the documentation for writing filtered parallel datasets is not very clear.

Thanks!

jhenderson · September 24, 2019, 6:06pm

Hi @jrichardshaw,

is it possible that you might have two different versions of OpenMPI in your system path that might be conflicting with one another in terms of what HDF5 was built with and what’s getting loaded when running the application? I quickly tried this under Linux using combinations of HDF5 1.10.5 and the development branch, along with OpenMPI 3.0.0 and 4.0.1 and all combinations seemed to work just fine.

jrichardshaw · September 24, 2019, 6:33pm

Hi @jhenderson, thanks for your help.

I’m fairly sure my environment is clean. I’m running both on my local macOS laptop and on a Linux cluster reproducing the same issue, and on the latter I’ve been careful with my module environment.

With all that said, I couldn’t reproduce the issue on the Linux cluster with the exact file I sent you. That one failed reliably on macOS, but on Linux I needed to increase the size to be much larger (apologies). If you can, try changing the definition of LONGSIZE to (1 << 20) - I’ve also updated the gist accordingly. This is reliably reproducing the issue on the cluster. On my local machine I could get away with it being only 128 and it still worked, but there must be some platform/config difference that’s causing issues.

I just built and ran with a very naive:

$ h5pcc -g -o testh5 test_ph5.c
$ mpirun -np 4 ./testh5

I’ll try and check a few more environments later this afternoon to double check how widely I can reproduce it.

jrichardshaw · September 24, 2019, 6:41pm

As a quick check in the meantime, I’ve just checked that my HDF5 build and environment agree with what OpenMPI installation they are using (using ldd for the former, and checking the mpirun path for the latter), and they do. So I don’t think a mismatch is causing any problems. Thanks!

jhenderson · September 24, 2019, 6:54pm

Hmm, entertainingly enough even the latest gist appears to pass with each of my separate builds. Perhaps I need to keep increasing the size until I see a problem. One key thing that I’ve noticed with the feature is that users have tended to expose the bugs with it by repeatedly performing a dataset write over and over, sometimes nearly a hundred times before an issue occurs. For example, a bug was exposed recently with an MPI communicator handle not being freed, but it only occurred once all available handles were taken up after several writes.

jrichardshaw · September 24, 2019, 10:08pm

I’ve just checked with my colleague Tristan, and got him to reproduce this issue on his own machine with the test code above. He’s running Arch, and has OpenMPI 4.0.1 and HDF5 1.10.5 installed as well.

Do you have a standard environment you use for testing parallel HDF5? I can attempt to find a way of reproducing it in that environment if you do.

Thanks!

wkliao · September 25, 2019, 12:43am

Hi, @jrichardshaw
I tested your program using MPICH 3.3 and HDF5 1.10.5. I got a similar error message below which gives more detailed information.
" #021: …/…/hdf5-1.10.5/src/H5FDmpio.c line 1807 in H5FD_mpio_write(): Other I/O error , error stack:
ADIO_Set_view(48): **iobadoverlap displacements of filetype must be in a monotonically nondecreasing order"

This error was reported from MPI library, meaning the MPI datatype created by HDF5 internally and used in MPI_File_set_view is violating the MPI standard requirement that the flattened file offsets must be monotonically nondecreasing. I believe OpenMPI also does the same checking and returns an error.

jhenderson · September 25, 2019, 1:16am

@wkliao This information actually seems to correspond with a known issue that has cropped up in the past but was notoriously difficult to reproduce. @jrichardshaw As far as a ‘standard environment’ for testing, in general I just use a fairly normal CentOS 6/7 system for testing parallel HDF5. Could you or @wkliao try building a debug version of HDF5 with asserts enabled? If you do this, then you should see an assertion failure like the following:

H5Dchunk.c:4876: H5D__chunk_collective_fill: Assertion `chunk_disp_array[i] > chunk_disp_array[i - 1]' failed.

If this is the case, then we are definitely aware of the issue but have not yet made efforts towards fixing it. It is, however, affecting another feature and this could help make the issue a higher priority.

wkliao · September 25, 2019, 1:48am

@jhenderson
I tested again with the debug version of HDF5 and got the exact error message you mentioned. My test machine is a Redhat Enterprise 6.10.

jhenderson · September 25, 2019, 2:52am

@wkliao and @jrichardshaw, could you try the attached patch I generated against the 1.10.5 release branch? Note that it’s a quick hack, so I wouldn’t consider it production quality or anything, but I’m interested to see if this approach fixes the issue that you’re encountering.

H5Dchunk_assertion.patch (1.6 KB)

steven · September 25, 2019, 12:30pm

Was able to reproduce this issue on AWS based HPC cluster 3 nodes x m5d.2xlarge:

Open MPI: 4.1.0a1
HDF5 Version: 1.11.6 Default API mapping: v112
slurm 19.05.1-2
pmix: 3.1.3rc4
3 nodes x m5d.2xlarge AWS instance

error message of srun -n 4 ./testh5

[master:06075] Read -1, expected 11776, errno = 38
[master:06076] Read -1, expected 11776, errno = 38
[master:06075] Read -1, expected 11776, errno = 38
[node01:06103] Read -1, expected 11776, errno = 38
testh5: H5Dchunk.c:4876: H5D__chunk_collective_fill: Assertion `chunk_disp_array[i] > chunk_disp_array[i - 1]' failed.
testh5: H5Dchunk.c:4876: H5D__chunk_collective_fill: Assertion `chunk_disp_array[i] > chunk_disp_array[i - 1]' failed.
[master:06076] *** Process received signal ***
[master:06075] *** Process received signal ***
[master:06075] Signal: Aborted (6)
[master:06075] Signal code:  (-6)
[master:06076] Signal: Aborted (6)
[master:06076] Signal code:  (-6)
[master:06076] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f70719d7890]
[master:06075] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f3b22f97890]
[master:06075] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3b22bd2e97]
[master:06076] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f7071612e97]
[master:06075] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3b22bd4801]
[master:06076] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f7071614801]
[master:06075] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7f3b22bc439a]
[master:06076] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7f707160439a]
[master:06075] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7f3b22bc4412]
[master:06076] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7f7071604412]
[master:06076] [ 5] /home/steven/./testh5(+0x38a643)[0x5601833dd643]
[master:06075] [ 5] /home/steven/./testh5(+0x38a643)[0x55ae770de643]
[master:06076] [ 6] /home/steven/./testh5(+0x398646)[0x5601833eb646]
[master:06076] [ 7] /home/steven/./testh5(+0x9b08a)[0x5601830ee08a]
[master:06076] [ 8] /home/steven/./testh5(+0xa04ea)[0x5601830f34ea]
[master:06076] [ 9] /home/steven/./testh5(+0xa922f)[0x5601830fc22f]
[master:06076] [10] /home/steven/./testh5(+0x99e16)[0x5601830ece16]
[master:06075] [ 6] /home/steven/./testh5(+0x398646)[0x55ae770ec646]
[master:06075] [ 7] /home/steven/./testh5(+0x9b08a)[0x55ae76def08a]
[master:06075] [ 8] /home/steven/./testh5(+0xa04ea)[0x55ae76df44ea]
[master:06075] [ 9] /home/steven/./testh5(+0xa922f)[0x55ae76dfd22f]
[master:06075] [10] /home/steven/./testh5(+0x99e16)[0x55ae76dede16]
[master:06076] [11] /home/steven/./testh5(+0x9c4e5)[0x5601830ef4e5]
[master:06076] [12] /home/steven/./testh5(+0x3ad615)[0x560183400615]
[master:06076] [13] /home/steven/./testh5(+0x1b86a1)[0x56018320b6a1]
[master:06076] [14] /home/steven/./testh5(+0x17dbaa)[0x5601831d0baa]
[master:06076] [15] /home/steven/./testh5(+0x13940c)[0x56018318c40c]
[master:06076] [16] /home/steven/./testh5(+0x13a1a4)[0x56018318d1a4]
[master:06076] [17] /home/steven/./testh5(+0x176300)[0x5601831c9300]
[master:06076] [18] /home/steven/./testh5(+0x17fbaa)[0x5601831d2baa]
[master:06076] [19] /home/steven/./testh5(+0x9b70f)[0x5601830ee70f]
[master:06075] [11] /home/steven/./testh5(+0x9c4e5)[0x55ae76df04e5]
[master:06075] [12] /home/steven/./testh5(+0x3ad615)[0x55ae77101615]
[master:06075] [13] /home/steven/./testh5(+0x1b86a1)[0x55ae76f0c6a1]
[master:06075] [14] /home/steven/./testh5(+0x17dbaa)[0x55ae76ed1baa]
[master:06076] [20] /home/steven/./testh5(+0x32f193)[0x560183382193]
[master:06076] [21] /home/steven/./testh5(+0x3123d2)[0x5601833653d2]
[master:06076] [22] /home/steven/./testh5(+0x31ac7f)[0x56018336dc7f]
[master:06076] [23] /home/steven/./testh5(+0x8f358)[0x5601830e2358]
[master:06076] [24] /home/steven/./testh5(+0xd12a)[0x56018306012a]
[master:06075] [15] /home/steven/./testh5(+0x13940c)[0x55ae76e8d40c]
[master:06075] [16] /home/steven/./testh5(+0x13a1a4)[0x55ae76e8e1a4]
[master:06075] [17] /home/steven/./testh5(+0x176300)[0x55ae76eca300]
[master:06075] [18] /home/steven/./testh5(+0x17fbaa)[0x55ae76ed3baa]
[master:06075] [19] /home/steven/./testh5(+0x9b70f)[0x55ae76def70f]
[master:06076] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f70715f5b97]
[master:06076] [26] /home/steven/./testh5(+0xcbea)[0x56018305fbea]
[master:06076] *** End of error message ***
[master:06075] [20] /home/steven/./testh5(+0x32f193)[0x55ae77083193]
[master:06075] [21] /home/steven/./testh5(+0x3123d2)[0x55ae770663d2]
[master:06075] [22] /home/steven/./testh5(+0x31ac7f)[0x55ae7706ec7f]
[master:06075] [23] /home/steven/./testh5(+0x8f358)[0x55ae76de3358]
[master:06075] [24] /home/steven/./testh5(+0xd12a)[0x55ae76d6112a]
[master:06075] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f3b22bb5b97]
[master:06075] [26] /home/steven/./testh5(+0xcbea)[0x55ae76d60bea]
[master:06075] *** End of error message ***
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node01
  Local PID:  6103
  Peer host:  master
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node01
  Local PID:  6104
  Peer host:  master
--------------------------------------------------------------------------
srun: error: master: tasks 0-1: Aborted
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: node01: tasks 2-3: Killed

jrichardshaw · September 25, 2019, 4:54pm

@jhenderson I’ve applied and tested your patch against HDF 1.10.5, and while it doesn’t crash it now hangs at the same stage. What seems to be happening is that the nodes have gotten out of sync and some hit an MPI section, and others didn’t. I’ve taken stack traces of each process and put them below, apologies for the verbose dump!

Updated with debugging output

Rank 0 (rank 1 is very similar, with just a deeper set of msort calls, probably dependent on when I sample):

#0  0x00007f42cd583b60 in memcpy@GLIBC_2.2.5 () from /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/lib/libc.so.6
#1  0x00007f42cd534ce2 in msort_with_tmp.part.0 () from /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/lib/libc.so.6
#2  0x00007f42cd534cf8 in msort_with_tmp.part.0 () from /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/lib/libc.so.6
#3  0x00007f42cd534ce2 in msort_with_tmp.part.0 () from /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/lib/libc.so.6
#4  0x00007f42cd53524f in qsort_r () from /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/lib/libc.so.6
#5  0x00000000004df4f0 in H5D__chunk_collective_fill (dset=0x1d66dc0, chunk_info=0x7ffe63a73530, chunk_size=531, fill_buf=0x1eb8ac0) at H5Dchunk.c:4683
#6  0x00000000004de2f6 in H5D__chunk_allocate (io_info=0x7ffe63a73b80, full_overwrite=false, old_dim=0x7ffe63a739d0) at H5Dchunk.c:4402
#7  0x000000000050fdaf in H5D__init_storage (io_info=0x7ffe63a73b80, full_overwrite=false, old_dim=0x7ffe63a739d0) at H5Dint.c:2421
#8  0x000000000050f7ea in H5D__alloc_storage (io_info=0x7ffe63a73b80, time_alloc=H5D_ALLOC_CREATE, full_overwrite=false, old_dim=0x0) at H5Dint.c:2334
#9  0x000000000051cb6b in H5D__layout_oh_create (file=0x1d534b0, oh=0x1db43f0, dset=0x1d66dc0, dapl_id=720575940379279367) at H5Dlayout.c:507
#10 0x000000000050938b in H5D__update_oh_info (file=0x1d534b0, dset=0x1d66dc0, dapl_id=720575940379279367) at H5Dint.c:976
#11 0x000000000050aa0d in H5D__create (file=0x1d534b0, type_id=216172782113783850, space=0x1d57bf0, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5Dint.c:1277
#12 0x000000000093059d in H5O__dset_create (f=0x1d534b0, _crt_info=0x7ffe63a748d0, obj_loc=0x7ffe63a73f80) at H5Doh.c:299
#13 0x00000000006b6519 in H5O_obj_create (f=0x1d534b0, obj_type=H5O_TYPE_DATASET, crt_info=0x7ffe63a748d0, obj_loc=0x7ffe63a73f80) at H5Oint.c:2452
#14 0x000000000065e722 in H5L__link_cb (grp_loc=0x7ffe63a74640, name=0x7ffe63a741a0 "dset2", lnk=0x0, obj_loc=0x0, _udata=0x7ffe63a747c0, own_loc=0x7ffe63a745ac) at H5L.c:1603
#15 0x00000000005f6a2e in H5G__traverse_real (_loc=0x7ffe63a74980, name=0x97c9ee "dset2", target=0, op=0x65e406 <H5L__link_cb>, op_data=0x7ffe63a747c0) at H5Gtraverse.c:626
#16 0x00000000005f78a1 in H5G_traverse (loc=0x7ffe63a74980, name=0x97c9ee "dset2", target=0, op=0x65e406 <H5L__link_cb>, op_data=0x7ffe63a747c0) at H5Gtraverse.c:850
#17 0x000000000065f30d in H5L__create_real (link_loc=0x7ffe63a74980, link_name=0x97c9ee "dset2", obj_path=0x0, obj_file=0x0, lnk=0x7ffe63a74850, ocrt_info=0x7ffe63a748f0, lcpl_id=720575940379279374) at H5L.c:1797
#18 0x000000000065e3a8 in H5L_link_object (new_loc=0x7ffe63a74980, new_name=0x97c9ee "dset2", ocrt_info=0x7ffe63a748f0, lcpl_id=720575940379279374) at H5L.c:1556
#19 0x0000000000505f0d in H5D__create_named (loc=0x7ffe63a74980, name=0x97c9ee "dset2", type_id=216172782113783850, space=0x1d57bf0, lcpl_id=720575940379279374, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5Dint.c:328
#20 0x00000000004bf947 in H5Dcreate2 (loc_id=72057594037927936, name=0x97c9ee "dset2", type_id=216172782113783850, space_id=288230376151711746, lcpl_id=720575940379279374, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5
D.c:144
#21 0x0000000000404003 in main (argc=1, argv=0x7ffe63a74b98) at test_ph5.c:81

Rank 2 (again rank 3 similar):

#0  0x00007f4db850bb46 in psm2_mq_ipeek2 () from /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/lib/libpsm2.so.2
#1  0x00007f4db8757409 in ompi_mtl_psm2_progress () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/openmpi/mca_mtl_psm2.so
#2  0x00007f4dbd0a4e0b in opal_progress () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libopen-pal.so.40
#3  0x00007f4dbdc46435 in ompi_request_default_wait () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#4  0x00007f4dbdca6303 in ompi_coll_base_sendrecv_actual () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#5  0x00007f4dbdca6739 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#6  0x00007f4dbdc5a5b8 in PMPI_Allreduce () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#7  0x00007f4dbdd2aafc in mca_io_romio_dist_MPI_File_set_view () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#8  0x00007f4dbdcfa9ab in mca_io_romio321_file_set_view () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#9  0x00007f4dbdc6ad68 in PMPI_File_set_view () from /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc8/openmpi/4.0.1/lib/libmpi.so.40
#10 0x000000000090aa05 in H5FD_mpio_write (_file=0x2aee1d0, type=H5FD_MEM_DRAW, dxpl_id=720575940379279368, addr=0, size=1, buf=0x2c52b20) at H5FDmpio.c:1806
#11 0x00000000005b6669 in H5FD_write (file=0x2aee1d0, type=H5FD_MEM_DRAW, addr=0, size=1, buf=0x2c52b20) at H5FDint.c:257
#12 0x0000000000933e3b in H5F__accum_write (f=0x2aee360, map_type=H5FD_MEM_DRAW, addr=0, size=1, buf=0x2c52b20) at H5Faccum.c:825
#13 0x000000000073df2c in H5PB_write (f=0x2aee360, type=H5FD_MEM_DRAW, addr=0, size=1, buf=0x2c52b20) at H5PB.c:1027
#14 0x00000000005885f4 in H5F_block_write (f=0x2aee360, type=H5FD_MEM_DRAW, addr=0, size=1, buf=0x2c52b20) at H5Fio.c:164
#15 0x00000000004dfa65 in H5D__chunk_collective_fill (dset=0x2b012e0, chunk_info=0x7ffc343d85a0, chunk_size=531, fill_buf=0x2c52b20) at H5Dchunk.c:4731
#16 0x00000000004de2f6 in H5D__chunk_allocate (io_info=0x7ffc343d8bf0, full_overwrite=false, old_dim=0x7ffc343d8a40) at H5Dchunk.c:4402
#17 0x000000000050fdaf in H5D__init_storage (io_info=0x7ffc343d8bf0, full_overwrite=false, old_dim=0x7ffc343d8a40) at H5Dint.c:2421
#18 0x000000000050f7ea in H5D__alloc_storage (io_info=0x7ffc343d8bf0, time_alloc=H5D_ALLOC_CREATE, full_overwrite=false, old_dim=0x0) at H5Dint.c:2334
#19 0x000000000051cb6b in H5D__layout_oh_create (file=0x2aee360, oh=0x2b4dac0, dset=0x2b012e0, dapl_id=720575940379279367) at H5Dlayout.c:507
#20 0x000000000050938b in H5D__update_oh_info (file=0x2aee360, dset=0x2b012e0, dapl_id=720575940379279367) at H5Dint.c:976
#21 0x000000000050aa0d in H5D__create (file=0x2aee360, type_id=216172782113783850, space=0x2af2790, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5Dint.c:1277
#22 0x000000000093059d in H5O__dset_create (f=0x2aee360, _crt_info=0x7ffc343d9940, obj_loc=0x7ffc343d8ff0) at H5Doh.c:299
#23 0x00000000006b6519 in H5O_obj_create (f=0x2aee360, obj_type=H5O_TYPE_DATASET, crt_info=0x7ffc343d9940, obj_loc=0x7ffc343d8ff0) at H5Oint.c:2452
#24 0x000000000065e722 in H5L__link_cb (grp_loc=0x7ffc343d96b0, name=0x7ffc343d9210 "dset2", lnk=0x0, obj_loc=0x0, _udata=0x7ffc343d9830, own_loc=0x7ffc343d961c) at H5L.c:1603
#25 0x00000000005f6a2e in H5G__traverse_real (_loc=0x7ffc343d99f0, name=0x97c9ee "dset2", target=0, op=0x65e406 <H5L__link_cb>, op_data=0x7ffc343d9830) at H5Gtraverse.c:626
#26 0x00000000005f78a1 in H5G_traverse (loc=0x7ffc343d99f0, name=0x97c9ee "dset2", target=0, op=0x65e406 <H5L__link_cb>, op_data=0x7ffc343d9830) at H5Gtraverse.c:850
#27 0x000000000065f30d in H5L__create_real (link_loc=0x7ffc343d99f0, link_name=0x97c9ee "dset2", obj_path=0x0, obj_file=0x0, lnk=0x7ffc343d98c0, ocrt_info=0x7ffc343d9960, lcpl_id=720575940379279374) at H5L.c:1797
#28 0x000000000065e3a8 in H5L_link_object (new_loc=0x7ffc343d99f0, new_name=0x97c9ee "dset2", ocrt_info=0x7ffc343d9960, lcpl_id=720575940379279374) at H5L.c:1556
#29 0x0000000000505f0d in H5D__create_named (loc=0x7ffc343d99f0, name=0x97c9ee "dset2", type_id=216172782113783850, space=0x2af2790, lcpl_id=720575940379279374, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5Dint.c:328
#30 0x00000000004bf947 in H5Dcreate2 (loc_id=72057594037927936, name=0x97c9ee "dset2", type_id=216172782113783850, space_id=288230376151711746, lcpl_id=720575940379279374, dcpl_id=720575940379279377, dapl_id=720575940379279367) at H5D.c:144
#31 0x0000000000404003 in main (argc=1, argv=0x7ffc343d9c08) at test_ph5.c:81

Hope that’s of some use. Let me know if there’s anything more I can help test out.

jrichardshaw · September 25, 2019, 4:58pm

Also, I clearly forgot to build in debug mode. I’ll rebuild now and reattach the stack traces if they have have anything more useful in them.

jhenderson · September 25, 2019, 6:16pm

From those traces, it looks like rank 0 and 1 allocated chunks out of address order and got stuck in a sort loop due to a naive mistake I just spotted in the patch I uploaded. Ranks 2 and 3, however, seem to have allocated chunks in the correct address order and went on to try doing the collective chunk write. Here’s a new patch that hopefully fixes that issue.

H5Dchunk_assertion.patch (2.6 KB)

Meanwhile, I’m going to see if I can find some system we can test this on where I can get the same results. It can be a bit non-deterministic as it relies on the chunks being allocated out of order and I’m not yet certain on a way to force this condition. Thankfully, the issue seems reproducible on a wide variety of systems, so this shouldn’t be too difficult.

jrichardshaw · September 25, 2019, 7:06pm

Bingo. That patch works! Thanks @jhenderson

jrichardshaw · October 11, 2019, 5:58pm

@jhenderson as it sounded like a known bug is there a JIRA issue for this one? I just want to keep track of it so I know when it has been merged. We’re currently using a patched version and I want to know when we can move back to a released version. Thanks again!

jhenderson · October 11, 2019, 7:22pm

Hi @jrichardshaw,

The JIRA issue we have for this is: HDFFV-10792.

I have a slightly revised set of changes that’s currently sitting in a PR for the HDF5 development branch, wherein we’re discussing coming up with a test/modifying what’s given here to be a test for the issue. However, I think it would be safe to assume that the fix for this issue will make it into the upcoming HDF5 release. Once that happens, we will probably need to work on assessing how much the newly-introduced sort call affects parallel performance and whether we should be trying to tackle the issue at different layers within the library so that we don’t get to the point where we need the sort.

jrichardshaw · January 25, 2020, 12:08am

@jhenderson I’ve just revisited this one as we’ve been trying to get our code running with your patch and I’ve found that with slight changes to the chunking parameters it still crashes. I’ve updated the test code I used above very slightly to change the chunk params (https://gist.github.com/jrs65/97e36592785d3db2729a8ed20521eaa6).

The various sets of parameters used and their behaviour in three versions of HDF5 (1.10.5, 1.10.5 with your final patch above, and 1.10.6) is documented in the comments in the gist. The salient points are that the original test case still failed on 1.10.6 (hanging rather than crashing); and that with a slight change to chunking parameters (small chunk size but more of them, which increases the total axis size) all versions, including your patched version crash.

I get slightly different messages depending on whether I enable compression or not, but it’s pretty much the same across all versions. The debug error messages I get without compression are:

HDF5-DIAG: Error detected in HDF5 (1.10.6) HDF5-DIAG: Error detMPI-process 2:
  #ected in HDF5 (1.10.6) 000: H5D.c line 151 in H5Dcreate2MPI-process 3:
(): unable to create dataset
    major: Dataset
    minor  #000: H5D.c: Unable to initialize object
  #001: H5Dint.c line 337 in H5D__create_named() line 151 in H5Dcreate2(): unable to create dataset
    major:: unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002 Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line: H5L.c line 1592 in H5L_link_object(): unable to create new link to object
    major: Links
     337 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minorminor: Unable to initialize object
  #003: H5L.c line 1833 in H5L__create_real(): : Unable to initialize object
  #002: H5L.c line 1592 in H5L_link_object(): unable to create new link to object
can't insert link
    major: Links
    minor: Unable to insert object
  #004:     major: Links
    minor: Unable to initialize object
  #003:H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    m H5L.c line 1833 in H5L__create_real(): can't insert link
    major: Links
inor: Object not found
  #005: H5Gtraverse.c line 582 in H5G__traverse_real(): can't look up component
    minor: Unable to insert object
  #004: H5Gtraverse.c line 851 in    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1126 in  H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c H5G__obj_lookup(): can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007: line 582 in H5G__traverse_real(): can't look up component
    major: Symbol table
    miH5Gobj.c line 327 in H5G__obj_get_linfo(): unable to read object header
    major: Symbol table
    minor: Object not found
  #006: H5Gobj.c line 1126 in H5G__obj_lookup(): nor: Can't get value
  #008: H5Omessage.c line 883 in H5O_msg_exists(can't check for link info message
    major: Symbol table
    minor: Can't get value
  #007): unable to protect object header
    major: Object header
    minor: Unable to protect metadata
  #009:: H5Gobj.c line 327 in H5G__obj_get_linfo(): unable to read object header
    major: H5Oint.c line 1066 in H5O_protect(): unable to load object header
    major: Object header
    m Symbol table
    minor: Can't get value
  #008: H5Omessage.c line 883 iinor: Unable to protect metadata
  #010: H5AC.c line 1352 in H5AC_protect(): H5C_protect() failedn H5O_msg_exists(): unable to protect object header
    major: Object header
    minor: Unable to protect metadata

    major: Object cache
    minor: Unable to protect metadata
  #011: H5C.c l  #009: H5Oint.c line 1066 in H5O_protect(): unable to load object header
    majine 2298 in H5C_protect(): MPI_Bcast failed
    major: Internal error (too specific to document in detail)
    minor: Object header
    minor: Unable to protect metadata
  #010: H5AC.c lineor: Some MPI function failed
  #012: H5C.c line 2298 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
 1352 in H5AC_protect(): H5C_protect() failed
    major: Object cache
    mino    major: Internal error (too specific to document in detail)
    minor: MPI Error String
r: Unable to protect metadata
  #011: H5C.c line 2298 in H5C_protect(): MPI_Bcast failed
rank=2 writing dataset2
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #012: H5C.c line 2298 in H5C_protect(): MPI_ERR_TRUNCATE: message truncated
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
rank=3 writing dataset2
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5Dio.c line 314 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5D.c line 337 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=3 closing everything
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5Dio.c line 314 in H5Dwrite(): dset_id is not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5D.c line 337 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
rank=2 closing everything

Any ideas that’s going wrong in here?

Thanks!

jhenderson · January 27, 2020, 8:26pm

Hi @jrichardshaw,

I can’t say that I quite know what’s going on from an initial glance, but since I can see that MPI_ERR_TRUNCATE is being returned by MPI, it’s possible that the MPI ranks might be getting out of sync with each other. I don’t have high hopes that this will fix the problem, but could you try making a call to MPI_Barrier(comm) between the writing of your first dataset and the creation of the next? I get the feeling that some of the ranks may have managed to rush ahead of the other ranks and then they try creating a dataset when the other ranks are still writing to the first dataset. If the ranks that rushed ahead manage to match up an MPI_Bcast() call with the other ranks, the call will be made with mismatched arguments and is a likely candidate for causing the MPI message truncation issue.

In related news, the patch for the first issue in this thread hasn’t been merged into the development branch yet, but should be soon for the upcoming release. It should also be easy to bring that fix back to the 1.10.x series. So, I would expect that your test case will still fail in version 1.10.6, although it is surprising to me that it changes to a hang rather than a crash. Perhaps there might be something about where the ranks are getting caught in this new hang that may point us in the right direction, but for now I think that a patched 1.10.5 will probably work better for you since 1.10.6 doesn’t yet have the fix.

In any case, I’ll try to look at this in the next few days as I get some time, but anything else you might discover about the problem will definitely be useful.

jrichardshaw · January 27, 2020, 9:34pm

Hi @jhenderson. Thanks for the reply. I’ve just quickly tried inserting a MPI_Barrier call into that test case between the two dataset writes and it didn’t seem to help I’m afraid. If I get a spare moment, I’ll try playing around with things a little more to see if I can get any more information. Thanks!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Crash when writing parallel compressed chunks