HDF5 1.8.10 crash with Parallel MPB

Hi Everyone,

I am trying to help one of our researchers for running MPB (http://ab-initio.mit.edu/wiki/index.php/MIT_Photonic_Bands) on our clusters, which uses HDF5. For OpenMPI based compilations, the process just hangs. For Mvapich2 stack, HDF fails with the error message copied below.

This only happens when using multiple cores/nodes and the code works well for a sequential run. I could not make sense of the error messages and blindly tested a few compiler/MPI combinations, but none seem to work. I will appreciate *any* suggestions for fixing or troubleshooting this problem!

Thanks in advance,
-Mehmet

···

==========================
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 0:
  #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
    major: Object atom
    minor: Unable to close file
  #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
    major: Object atom
    minor: Unable to decrement reference count
  #002: H5F.c line 1835 in H5F_close(): can't close file
    major: File accessability
    minor: Unable to close file
  #003: H5F.c line 1997 in H5F_try_close(): problems closing file
    major: File accessability
    minor: Unable to close file
  #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
    major: File accessability
    minor: Write failed
  #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
    major: Virtual File Layer
    minor: Can't update object
  #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 1:
  #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
    major: Object atom
    minor: Unable to close file
  #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
    major: Object atom
    minor: Unable to decrement reference count
  #002: H5F.c line 1835 in H5F_close(): can't close file
    major: File accessability
    minor: Unable to close file
  #003: H5F.c line 1997 in H5F_try_close(): problems closing file
    major: File accessability
    minor: Unable to close file
  #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
    major: File accessability
    minor: Write failed
  #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
    major: Virtual File Layer
    minor: Can't update object
  #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[iw-h43-29:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[iw-h43-29:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

Just a wild guess, but I have seen similar error messages when accidentally
calling H5Aclose on a dataset handle (which should be closed with H5Dclose...),
i.e., make sure you are calling the right H5?close on handles.

Best, G.

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Mehmet Belgin
Sent: Friday, November 15, 2013 4:15 PM
To: Hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] HDF5 1.8.10 crash with Parallel MPB

Hi Everyone,

I am trying to help one of our researchers for running MPB (http://ab-initio.mit.edu/wiki/index.php/MIT_Photonic_Bands) on our clusters, which uses HDF5. For OpenMPI based compilations, the process just hangs. For Mvapich2 stack, HDF fails with the error message copied below.

This only happens when using multiple cores/nodes and the code works well for a sequential run. I could not make sense of the error messages and blindly tested a few compiler/MPI combinations, but none seem to work. I will appreciate *any* suggestions for fixing or troubleshooting this problem!

Thanks in advance,
-Mehmet

==========================
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 0:
  #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
    major: Object atom
    minor: Unable to close file
  #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
    major: Object atom
    minor: Unable to decrement reference count
  #002: H5F.c line 1835 in H5F_close(): can't close file
    major: File accessability
    minor: Unable to close file
  #003: H5F.c line 1997 in H5F_try_close(): problems closing file
    major: File accessability
    minor: Unable to close file
  #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
    major: File accessability
    minor: Write failed
  #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
    major: Virtual File Layer
    minor: Can't update object
  #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 1:
  #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
    major: Object atom
    minor: Unable to close file
  #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
    major: Object atom
    minor: Unable to decrement reference count
  #002: H5F.c line 1835 in H5F_close(): can't close file
    major: File accessability
    minor: Unable to close file
  #003: H5F.c line 1997 in H5F_try_close(): problems closing file
    major: File accessability
    minor: Unable to close file
  #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
    major: File accessability
    minor: Write failed
  #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
    major: Virtual File Layer
    minor: Can't update object
  #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[iw-h43-29:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[iw-h43-29:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)