Segfault when building HDF5 with MVAPICH2 2.1rc1 on SLES11 SP3

HDF Forum,

As you helped with my last issue, I now have another that is weirdly specific. When I try to build either HDF5-1.8.12 or the latest stable HDF5-1.8.14 (with --enable-parallel) with MVAPICH2 2.1rc1 on SLES 11 SP3, the HDF5 build fails with a segfault:

libtool: link: mpicc -std=c99 -O3 -fPIC -o H5make_libsettings H5make_libsettings.o -L/discover/swdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/x86_64-unknown-linux-gnu/ifort/Linux/lib /discover/swdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/x86_64-unknown-linux-gnu/ifort/Linux/lib/libsz.a -lz -ldl -lm
LD_LIBRARY_PATH="$LD_LIBRARY_PATH`echo -lm | \
    sed -e 's/-L/:/g' -e 's/ //g'`" \
   ./H5make_libsettings > H5lib_settings.c || \
      (test $HDF5_Make_Ignore && echo "*** Error ignored") || \
      (rm -f H5lib_settings.c ; exit 1)
/bin/sh: line 4: 1838 Segmentation fault (core dumped) LD_LIBRARY_PATH="$LD_LIBRARY_PATH`echo -lm | sed -e 's/-L/:/g' -e 's/ //g'`" ./H5make_libsettings > H5lib_settings.c
make[3]: *** [H5lib_settings.c] Error 1
make[3]: Leaving directory `/gpfsm/dswdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/src/hdf5-1.8.14/src'

I can be fairly certain with that specificity because I've tried the following things (all with Intel 15.0.0.090):

   MVAPICH2 2.1rc1 on SLES 11 SP1: Works
   MVAPICH2 2.1rc1 on SLES 11 SP3: FAIL
   Intel MPI 5.0.1.135 on SLES 11 SP1: Works
   Intel MPI 5.0.1.135 on SLES 11 SP3: Works
   MPT 2.11 on SLES 11 SP3: Works

I've also tried without --enable-parallel:

   No Parallel HDF5 on SLES 11 SP3: Works

though in that case, the C compiler would be gcc not icc (since it's not calling mpicc which points to icc).

Other than that, everything else is the same in each environment.

I also tried compiling with -O0 -g -traceback and got the same failure. Looking at the core in gdb:

(gdb) backtrace
#0 0x00002aaaabfe0802 in _int_free () from /lib64/libc.so.6
#1 0x00002aaaabfe3b5c in free () from /lib64/libc.so.6
#2 0x00002aaaaf70c35d in ?? () from /lib64/libnss_sss.so.2
#3 0x00002aaaaf70c6f0 in ?? () from /lib64/libnss_sss.so.2
#4 0x00002aaaaf70a275 in _nss_sss_getpwuid_r () from /lib64/libnss_sss.so.2
#5 0x00002aaaac00fb2c in getpwuid_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6 0x00002aaaac00f37f in getpwuid () from /lib64/libc.so.6
#7 0x0000000000401993 in print_header () at H5make_libsettings.c:185
#8 0x0000000000401d3a in main () at H5make_libsettings.c:290

From this testing it seems like it isn't the compiler, it isn't just the operating system, and it isn't just the MPI stack, but rather MVAPICH2 2.1rc1 and SLES 11 SP3. This has cropped up because part of the supercomputer I work on has transitioned to SLES 11 SP3. And in attempting to rebuild some libraries to diagnose some issues, this came up.

Now I have better computer engineers than I here trying to figure this out as well, but I was wondering if anyone here might know why one would fail while others succeed? That is, if you've seen something similar?

Matt

···

--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246

Hi Matt,

I haven't seen this one before so I might not be of much help..

Are you by any chance on a platform where you are cross compiling?
In that case you need to set RUNSERIAL and RUNPARALLEL to something like "mpirun -np 1; mpirun -np 6" or "aprun -n 1; aprun -n 6" at configure time, to force the program bellow to run on the compute nodes..

You can also cd into the src directory and try and run this manually with
mpirun -np 1 ./H5make_libsettings > H5lib_settings.c

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
Sent: Friday, February 06, 2015 7:53 AM
To: HDF Users Discussion List
Subject: [Hdf-forum] Segfault when building HDF5 with MVAPICH2 2.1rc1 on SLES11 SP3

HDF Forum,

As you helped with my last issue, I now have another that is weirdly specific. When I try to build either HDF5-1.8.12 or the latest stable
HDF5-1.8.14 (with --enable-parallel) with MVAPICH2 2.1rc1 on SLES 11 SP3, the HDF5 build fails with a segfault:

libtool: link: mpicc -std=c99 -O3 -fPIC -o H5make_libsettings H5make_libsettings.o -L/discover/swdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/x86_64-unknown-linux-gnu/ifort/Linux/lib /discover/swdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/x86_64-unknown-linux-gnu/ifort/Linux/lib/libsz.a -lz -ldl -lm
LD_LIBRARY_PATH="$LD_LIBRARY_PATH`echo -lm | \
    sed -e 's/-L/:/g' -e 's/ //g'`" \
   ./H5make_libsettings > H5lib_settings.c || \
      (test $HDF5_Make_Ignore && echo "*** Error ignored") || \
      (rm -f H5lib_settings.c ; exit 1)
/bin/sh: line 4: 1838 Segmentation fault (core dumped) LD_LIBRARY_PATH="$LD_LIBRARY_PATH`echo -lm | sed -e 's/-L/:/g' -e 's/ //g'`" ./H5make_libsettings > H5lib_settings.c
make[3]: *** [H5lib_settings.c] Error 1
make[3]: Leaving directory `/gpfsm/dswdev/USER/Baselibs/TmpBaselibs/GMAO-Baselibs-4_0_6-FixHDF5/src/hdf5-1.8.14/src'

I can be fairly certain with that specificity because I've tried the following things (all with Intel 15.0.0.090):

   MVAPICH2 2.1rc1 on SLES 11 SP1: Works
   MVAPICH2 2.1rc1 on SLES 11 SP3: FAIL
   Intel MPI 5.0.1.135 on SLES 11 SP1: Works
   Intel MPI 5.0.1.135 on SLES 11 SP3: Works
   MPT 2.11 on SLES 11 SP3: Works

I've also tried without --enable-parallel:

   No Parallel HDF5 on SLES 11 SP3: Works

though in that case, the C compiler would be gcc not icc (since it's not calling mpicc which points to icc).

Other than that, everything else is the same in each environment.

I also tried compiling with -O0 -g -traceback and got the same failure.
Looking at the core in gdb:

(gdb) backtrace
#0 0x00002aaaabfe0802 in _int_free () from /lib64/libc.so.6
#1 0x00002aaaabfe3b5c in free () from /lib64/libc.so.6
#2 0x00002aaaaf70c35d in ?? () from /lib64/libnss_sss.so.2
#3 0x00002aaaaf70c6f0 in ?? () from /lib64/libnss_sss.so.2
#4 0x00002aaaaf70a275 in _nss_sss_getpwuid_r () from
/lib64/libnss_sss.so.2
#5 0x00002aaaac00fb2c in getpwuid_r@@GLIBC_2.2.5 () from
/lib64/libc.so.6
#6 0x00002aaaac00f37f in getpwuid () from /lib64/libc.so.6
#7 0x0000000000401993 in print_header () at H5make_libsettings.c:185
#8 0x0000000000401d3a in main () at H5make_libsettings.c:290

From this testing it seems like it isn't the compiler, it isn't just the operating system, and it isn't just the MPI stack, but rather
MVAPICH2 2.1rc1 and SLES 11 SP3. This has cropped up because part of the supercomputer I work on has transitioned to SLES 11 SP3. And in attempting to rebuild some libraries to diagnose some issues, this came up.

Now I have better computer engineers than I here trying to figure this out as well, but I was wondering if anyone here might know why one would fail while others succeed? That is, if you've seen something similar?

Matt

--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5