Parallel HDF5 1.10.2 make check fails


#1

Hi,

I’d like to compile hdf5 in parallel using opempi compiled with intel 2018.

When executing make check, some tests like t_mpi run without problems but the parallel testphdf5 fails with this error


===================================
Testing testphdf5
===================================
testphdf5 Test Log
===================================
PHDF5 TESTS START
===================================
MPI-process 1. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 0. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 4. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 2. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 3. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 5. hostname=piscopia

For help use: /home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test files in a
different directory or to add file type prefix. E.g.,
HDF5_PARAPREFIX=pfs:/PFS/user/me
export HDF5_PARAPREFIX
*** End of Hint ***
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
Test filenames are:
ParaTest.h5
Testing – fapl_mpio duplicate (mpiodup)
Testing – dataset using split communicators (split)
Testing – dataset using split communicators (split)
Testing – dataset using split communicators (split)
Testing – dataset using split communicators (split)
Testing – dataset using split communicators (split)
Testing – dataset using split communicators (split)
Testing – Coll Metadata file property settings (props)
Testing – Coll Metadata file property settings (props)
Testing – Coll Metadata file property settings (props)
Testing – Coll Metadata file property settings (props)
Testing – Coll Metadata file property settings (props)
Testing – Coll Metadata file property settings (props)
Testing – dataset independent write (idsetw)
Testing – dataset independent write (idsetw)
Testing – dataset independent write (idsetw)
Testing – dataset independent write (idsetw)
Testing – dataset independent write (idsetw)
Testing – dataset independent write (idsetw)
Testing – dataset independent read (idsetr)
Testing – dataset independent read (idsetr)
Testing – dataset independent read (idsetr)
Testing – dataset independent read (idsetr)
Testing – dataset independent read (idsetr)
Testing – dataset independent read (idsetr)
Testing – dataset collective write (cdsetw)
Testing – dataset collective write (cdsetw)
Testing – dataset collective write (cdsetw)
Testing – dataset collective write (cdsetw)
Testing – dataset collective write (cdsetw)
Testing – dataset collective write (cdsetw)
Testing – dataset collective read (cdsetr)
Testing – dataset collective read (cdsetr)
Testing – dataset collective read (cdsetr)
Testing – dataset collective read (cdsetr)
Testing – dataset collective read (cdsetr)
Testing – dataset collective read (cdsetr)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent write (eidsetw)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset independent read (eidsetr)
Testing – extendible dataset collective write (ecdsetw)
Testing – extendible dataset collective write (ecdsetw)
Testing – extendible dataset collective write (ecdsetw)
Testing – extendible dataset collective write (ecdsetw)
Testing – extendible dataset collective write (ecdsetw)
Testing – extendible dataset collective write (ecdsetw)
Proc 1: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 4: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 2: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 0: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 3: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 5: *** Parallel ERROR ***
VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[piscopia:34187] 5 more processes have sent help message help-mpi-api.txt / mpi-abort
[piscopia:34187] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
Makefile:1432: recipe for target ‘testphdf5.chkexe_’ failed
make[4]: *** [testphdf5.chkexe_] Error 1
make[4]: Leaving directory ‘/home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar’
Makefile:1541: recipe for target ‘build-check-p’ failed
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory ‘/home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar’
Makefile:1412: recipe for target ‘test’ failed
make[2]: *** [test] Error 2
make[2]: Leaving directory ‘/home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar’
Makefile:1213: recipe for target ‘check-am’ failed
make[1]: *** [check-am] Error 2
make[1]: Leaving directory ‘/home/magaldi/Softwares/Hdf5/parallel_ver/hdf5-1.10.2/testpar’
Makefile:652: recipe for target ‘check-recursive’ failed
make: *** [check-recursive] Error 1


Could you please help me find out what’s going on?
Thanks,
m


#2

I also ran the verbose in the testpar/.libs with testphdf5 -v obtaining the following segmentation fault error:


===================================
PHDF5 TESTS START
===================================
MPI-process 0. hostname=piscopia
Collective chunk IO optimization APIs needs at least 3 processes to participate
Collective chunk IO API tests will be skipped
rr_obj_hdr_flush_confusion test needs at least 3 processes.
rr_obj_hdr_flush_confusion test will be skipped
File Image Ops daisy chain test needs at least 2 processes.
File Image Ops daisy chain test will be skipped
Atomicity tests need at least 2 processes to participate
8 is more recommended… Atomicity tests will be skipped

For help use: ./testphdf5 -help
Linked with hdf5 version 1.10 release 2
[piscopia:12894] *** Process received signal ***
[piscopia:12894] Signal: Segmentation fault (11)
[piscopia:12894] Signal code: Address not mapped (1)
[piscopia:12894] Failing at address: (nil)
[piscopia:12894] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f48af863390]
[piscopia:12894] [ 1] ./testphdf5[0x452e01]
[piscopia:12894] [ 2] ./testphdf5[0x452849]
[piscopia:12894] [ 3] ./testphdf5[0x4077bd]
[piscopia:12894] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f48af4a8830]
[piscopia:12894] [ 5] ./testphdf5[0x406ea9]
[piscopia:12894] *** End of error message ***
Segmentation fault (core dumped)



#3

Hello!

Could you please provide us with the version of OpenMPI and the system you are on?

Thank you!

Elena


#4

Hi Elena, thanks for replying.

Sure sorry for not having done before.

OpenMPI: ver 3.1.0 compiled with Intel compilers 2018 ver 2.199 under Ubuntu 16.04 LTS

The system is small and made up of 20 cores in total: 2 x 34078 Xeon 10-Core E5-2640v4 2,4Ghz 25MB

Thanks,
m.


#5

I cannot start a new topic. So can I ask the question here?

Do you know if Hdf5-1.10.2 parallel mode works with SGI mpt?

Thanks,

Haiying


#6

I’m also having this error with OpenMPI 3.1.0, but compiling with GCC 8.1.0.

Log:

Testing  testphdf5 
============================
 testphdf5  Test Log
============================
===================================
PHDF5 TESTS START
===================================
MPI-process 1. hostname=archange
MPI-process 3. hostname=archange

For help use: /build/hdf5-openmpi/src/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2

For help use: /build/hdf5-openmpi/src/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 0. hostname=archange

For help use: /build/hdf5-openmpi/src/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
MPI-process 2. hostname=archange

For help use: /build/hdf5-openmpi/src/hdf5-1.10.2/testpar/.libs/testphdf5 -help
Linked with hdf5 version 1.10 release 2
Test filenames are:
    ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test files in a
different directory or to add file type prefix. E.g.,
   HDF5_PARAPREFIX=pfs:/PFS/user/me
   export HDF5_PARAPREFIX
*** End of Hint ***
Test filenames are:
    ParaTest.h5
Test filenames are:
    ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Test filenames are:
    ParaTest.h5
Testing  -- fapl_mpio duplicate (mpiodup) 
Testing  -- fapl_mpio duplicate (mpiodup) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- dataset using split communicators (split) 
Testing  -- Coll Metadata file property settings (props) 
Testing  -- Coll Metadata file property settings (props) 
Testing  -- Coll Metadata file property settings (props) 
Testing  -- Coll Metadata file property settings (props) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent write (idsetw) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset independent read (idsetr) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective write (cdsetw) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- dataset collective read (cdsetr) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent write (eidsetw) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset independent read (eidsetr) 
Testing  -- extendible dataset collective write (ecdsetw) 
Testing  -- extendible dataset collective write (ecdsetw) 
Testing  -- extendible dataset collective write (ecdsetw) 
Testing  -- extendible dataset collective write (ecdsetw) 
Proc 3: *** Parallel ERROR ***
    VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 1: *** Parallel ERROR ***
    VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 2: *** Parallel ERROR ***
    VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
Proc 0: *** Parallel ERROR ***
    VRFY (H5Dwrite succeeded) failed at line 2211 in t_dset.c
aborting MPI processes
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[archange:17540] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[archange:17540] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Command exited with non-zero status 1
0.01user 0.01system 0:00.11elapsed 30%CPU (0avgtext+0avgdata 16672maxresident)k
0inputs+0outputs (0major+3548minor)pagefaults 0swaps
make[4]: *** [Makefile:1432: testphdf5.chkexe_] Error 1
make[4]: Leaving directory '/build/hdf5-openmpi/src/hdf5-1.10.2/testpar'
make[3]: *** [Makefile:1543: build-check-p] Error 1
make[3]: Leaving directory '/build/hdf5-openmpi/src/hdf5-1.10.2/testpar'
make[2]: *** [Makefile:1413: test] Error 2
make[2]: Leaving directory '/build/hdf5-openmpi/src/hdf5-1.10.2/testpar'
make[1]: *** [Makefile:1214: check-am] Error 2
make[1]: Leaving directory '/build/hdf5-openmpi/src/hdf5-1.10.2/testpar'
make: *** [Makefile:652: check-recursive] Error 1

#7

Actually, I think this is “expected”:

Three tests fail with OpenMPI 3.0.0/GCC-7.2.0-2.29:
        testphdf5 (ecdsetw, selnone, cchunk1, cchunk3, cchunk4, and actualio)
        t_shapesame (sscontig2)
        t_pflush1/fails on exit
The first two tests fail attempting collective writes.

(Source)


#8

Hello (mmagaldi)!

I entered a bug report for us to test Parallel HDF5-1.1.0.2 with OpenMPI 3.1.0 compiled with Intel compilers 2018 ver 2.199 under Ubuntu 16.04 LTS.

The bug report for your reference is: HDFFV-10507

I also entered a bug report for the SGI mpt question (haiying): HDFFV-10506

A while back we tried building Parallel HDF5 with an old version of SGI mpt and could not get
it to work. We do not have access to an SGI at the moment, but it would be good to test this if we get a chance.

Thanks!
-Barbara


#9

Hello! I have a very similar problem. I would like to ask you whether you can solve the problem or not.

Thanks,
Semsi


#10

I would like to report the same problem as noted above using the repository version of HDF5 downloaded from here:

https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5/browse

and compiling with:

Intel(R) MPI Library for Linux* OS, Version 2019 Build 20180829 (id: 15f5d6c0c)
Copyright 2003-2018, Intel Corporation.

on

Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-30-generic x86_64).

Furthermore, when running make check in the directory /testpar, I receive the following additional errors (extract from stderr output):

...    
980 of 1224 subtests skipped to expedite testing.
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    Testing  -- Cntg hslab, col IO, chnk dsets (sscontig4) 
    HDF5-DIAG: Error detected in HDF5 (1.11.4) MPI-process 0:
      #000: H5Dio.c line 319 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
      #001: H5VLcallback.c line 2103 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
      #002: H5VLcallback.c line 2069 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
      #003: H5VLnative_dataset.c line 222 in H5VL__native_dataset_write(): can't write data
    major: Dataset
    minor: Write failed
      #004: H5Dio.c line 790 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
      #005: H5Dmpio.c line 957 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
      #006: H5Dmpio.c line 880 in H5D__chunk_collective_io(): couldn't finish linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
      #007: H5Dmpio.c line 1231 in H5D__link_chunk_collective_io(): couldn't finish MPI-IO
    major: Low-level I/O
    minor: Can't get value
      #008: H5Dmpio.c line 2121 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
      #009: H5Dmpio.c line 491 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
      #010: H5Fio.c line 163 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
      #011: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
      #012: H5Faccum.c line 826 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
      #013: H5FDint.c line 249 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
      #014: H5FDmpio.c line 1631 in H5FD__mpio_write(): MPI_File_set_view failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
      #015: H5FDmpio.c line 1631 in H5FD__mpio_write(): Other I/O error , error stack:
    ADIO_Set_view(48):  **iobadoverlap displacements of filetype must be in a monotonically nondecreasing order
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
    Proc 0: *** Parallel ERROR ***
    VRFY (H5Dwrite() large_dataset initial write succeeded) failed at line  623 in t_shapesame.c
    aborting MPI processes
    Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
    [cli_0]: readline failed
    116.30user 4.52system 0:28.04elapsed 430%CPU (0avgtext+0avgdata 118476maxresident)k
    0inputs+0outputs (0major+117939minor)pagefaults 0swaps
...

Thanks in advance for any help you may be able to provide!

Jed