hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

I am having failures running the hdf5-1.10.0-patch1 parallel tests testphdf5. The t_mpi test passes with no issues.

Many of the failures occur in the call stack with PMPI_File_set_view being called by H5FDWrite. I am using gcc-4.7.2 and openmpi-1.6.4 on a RHEL6 system. I am also getting failures on OSX El Capitan with gcc-4.9.4 and openmpi.

On RHEL6, the eidsetw2 is one of the tests failing. The backtrace is:

[...] *** Process received signal ***
[...] Signal: Segmentation fault (11)
[...] Signal code: Address not mapped (1)
[...] Failing at address: (nil)
[...] [ 0] /lib64/libpthread.so.0() [0x3481a0f710]
[...] [ 1] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten+0x450) [0x7f65d968e5e0]
[...] [ 2] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten_datatype+0xc5) [0x7f65d9690495]
[...] [ 3] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIO_Set_view+0x1da) [0x7f65d96852ca]
[...] [ 4] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x172) [0x7f65d9695db2]
[...] [ 5] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/libmpi.so.1(MPI_File_set_view+0x107) [0x7f65e1ae3e77]
[...] [ 6] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x5f899f) [0x7f65e242599f]
[...] [ 7] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5FD_write+0x4e0) [0x7f65e203d232]
[...] [ 8] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F__accum_write+0x184a) [0x7f65e1ffbd84]
[...] [ 9] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F_block_write+0x40c) [0x7f65e20023dd]
[...] [10] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x118fcf) [0x7f65e1f45fcf]
[...] [11] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__chunk_allocate+0x1af6) [0x7f65e1f443f5]
[...] [12] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x152fd3) [0x7f65e1f7ffd3]
[...] [13] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__alloc_storage+0x665) [0x7f65e1f7f95f]
[...] [14] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__layout_oh_create+0x57a) [0x7f65e1f8dee0]
[...] [15] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x14bc99) [0x7f65e1f78c99]
[...] [16] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create+0x1162) [0x7f65e1f7a5e8]
[...] [17] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x165662) [0x7f65e1f92662]
[...] [18] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5O_obj_create+0x2ec) [0x7f65e21438e2]
[...] [19] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f1930) [0x7f65e211e930]
[...] [20] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x27eba0) [0x7f65e20abba0]
[...] [21] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5G_traverse+0x4ff) [0x7f65e20acd6b]
[...] [22] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f259a) [0x7f65e211f59a]
[...] [23] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5L_link_object+0x1d3) [0x7f65e211e6ae]
[...] [24] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create_named+0x3d1) [0x7f65e1f75f17]
[...] [25] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5Dcreate2+0x68f) [0x7f65e1f1e703]
[...] [26] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(extend_writeInd2+0x598) [0x416512]
[...] [27] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(PerformTests+0x1ab) [0x45addc]
[...] [28] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(main+0x94c) [0x408949]
[...] [29] /lib64/libc.so.6(__libc_start_main+0xfd) [0x348161ed5d]
[...] *** End of error message ***

I’m not really asking for anyone to debug this for me, just wondering if anyone else is having issues running the parallel tests with hdf5-1.10.0-patch1.

Thanks,
..Greg

···

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

I think the issue is related to using an older openmpi (or maybe just using openmpi). In hdf5-1.8.16, H5Dchunk.c, there is a comment about working around a bug for MPI_Type_create_hindexed_block(). The comment says that “should not have a special case for blocks == 0, but ompi (as of 1.8.1) has a bug in file_set_view when a zero size datatype is create with hindexed or hvector.”

This fix is not in hdf5-1.10.0-patch1. My cases are failing (with openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and they are failing with MPI_File_set_view in the backtrace. If I pull the workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code makes it past this point (but then fails an assert at a later point in the test).

..Greg

···

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of "Sjaardema, Gregory D" <gdsjaar@sandia.gov>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Date: Tuesday, October 25, 2016 at 1:20 PM
To: "hdf-forum@lists.hdfgroup.org" <hdf-forum@lists.hdfgroup.org>
Subject: [EXTERNAL] [Hdf-forum] hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

I am having failures running the hdf5-1.10.0-patch1 parallel tests testphdf5. The t_mpi test passes with no issues.

Many of the failures occur in the call stack with PMPI_File_set_view being called by H5FDWrite. I am using gcc-4.7.2 and openmpi-1.6.4 on a RHEL6 system. I am also getting failures on OSX El Capitan with gcc-4.9.4 and openmpi.

On RHEL6, the eidsetw2 is one of the tests failing. The backtrace is:

[...] *** Process received signal ***
[...] Signal: Segmentation fault (11)
[...] Signal code: Address not mapped (1)
[...] Failing at address: (nil)
[...] [ 0] /lib64/libpthread.so.0() [0x3481a0f710]
[...] [ 1] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten+0x450) [0x7f65d968e5e0]
[...] [ 2] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten_datatype+0xc5) [0x7f65d9690495]
[...] [ 3] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIO_Set_view+0x1da) [0x7f65d96852ca]
[...] [ 4] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x172) [0x7f65d9695db2]
[...] [ 5] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/libmpi.so.1(MPI_File_set_view+0x107) [0x7f65e1ae3e77]
[...] [ 6] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x5f899f) [0x7f65e242599f]
[...] [ 7] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5FD_write+0x4e0) [0x7f65e203d232]
[...] [ 8] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F__accum_write+0x184a) [0x7f65e1ffbd84]
[...] [ 9] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F_block_write+0x40c) [0x7f65e20023dd]
[...] [10] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x118fcf) [0x7f65e1f45fcf]
[...] [11] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__chunk_allocate+0x1af6) [0x7f65e1f443f5]
[...] [12] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x152fd3) [0x7f65e1f7ffd3]
[...] [13] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__alloc_storage+0x665) [0x7f65e1f7f95f]
[...] [14] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__layout_oh_create+0x57a) [0x7f65e1f8dee0]
[...] [15] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x14bc99) [0x7f65e1f78c99]
[...] [16] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create+0x1162) [0x7f65e1f7a5e8]
[...] [17] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x165662) [0x7f65e1f92662]
[...] [18] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5O_obj_create+0x2ec) [0x7f65e21438e2]
[...] [19] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f1930) [0x7f65e211e930]
[...] [20] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x27eba0) [0x7f65e20abba0]
[...] [21] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5G_traverse+0x4ff) [0x7f65e20acd6b]
[...] [22] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f259a) [0x7f65e211f59a]
[...] [23] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5L_link_object+0x1d3) [0x7f65e211e6ae]
[...] [24] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create_named+0x3d1) [0x7f65e1f75f17]
[...] [25] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5Dcreate2+0x68f) [0x7f65e1f1e703]
[...] [26] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(extend_writeInd2+0x598) [0x416512]
[...] [27] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(PerformTests+0x1ab) [0x45addc]
[...] [28] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(main+0x94c) [0x408949]
[...] [29] /lib64/libc.so.6(__libc_start_main+0xfd) [0x348161ed5d]
[...] *** End of error message ***

I’m not really asking for anyone to debug this for me, just wondering if anyone else is having issues running the parallel tests with hdf5-1.10.0-patch1.

Thanks,
..Greg

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

Good hunch about ompi. OpenMPI fixed this bug a couple years back.

==rob

···

On 10/25/2016 06:41 PM, Sjaardema, Gregory D wrote:

I think the issue is related to using an older openmpi (or maybe just
using openmpi). In hdf5-1.8.16, H5Dchunk.c, there is a comment about
working around a bug for MPI_Type_create_hindexed_block(). The comment
says that �should not have a special case for blocks == 0, but ompi (as
of 1.8.1) has a bug in file_set_view when a zero size datatype is create
with hindexed or hvector.�

This fix is not in hdf5-1.10.0-patch1. My cases are failing (with
openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and
they are failing with MPI_File_set_view in the backtrace. If I pull the
workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code
makes it past this point (but then fails an assert at a later point in
the test).

Good hunch about ompi. OpenMPI fixed this bug a couple years back.
    
    ==rob

Do you happen to know the version this was fixed in? I can look, but if you know off-hand, it would save me some searching.
..Greg

···

On 11/4/16, 2:38 PM, "Hdf-forum on behalf of Rob Latham" <hdf-forum-bounces@lists.hdfgroup.org on behalf of robl@mcs.anl.gov> wrote:

    On 10/25/2016 06:41 PM, Sjaardema, Gregory D wrote:
    > I think the issue is related to using an older openmpi (or maybe just
    > using openmpi). In hdf5-1.8.16, H5Dchunk.c, there is a comment about
    > working around a bug for MPI_Type_create_hindexed_block(). The comment
    > says that “should not have a special case for blocks == 0, but ompi (as
    > of 1.8.1) has a bug in file_set_view when a zero size datatype is create
    > with hindexed or hvector.”
    >
    >
    >
    > This fix is not in hdf5-1.10.0-patch1. My cases are failing (with
    > openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and
    > they are failing with MPI_File_set_view in the backtrace. If I pull the
    > workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code
    > makes it past this point (but then fails an assert at a later point in
    > the test).
    
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5