I am trying to help one of our researchers for running MPB (http://ab-initio.mit.edu/wiki/index.php/MIT_Photonic_Bands) on our clusters, which uses HDF5. For OpenMPI based compilations, the process just hangs. For Mvapich2 stack, HDF fails with the error message copied below.
This only happens when using multiple cores/nodes and the code works well for a sequential run. I could not make sense of the error messages and blindly tested a few compiler/MPI combinations, but none seem to work. I will appreciate *any* suggestions for fixing or troubleshooting this problem!
Thanks in advance,
-Mehmet
···
==========================
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 0: #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
major: Object atom
minor: Unable to decrement reference count #002: H5F.c line 1835 in H5F_close(): can't close file
major: File accessability
minor: Unable to close file #003: H5F.c line 1997 in H5F_try_close(): problems closing file
major: File accessability
minor: Unable to close file #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
major: File accessability
minor: Write failed #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
major: Virtual File Layer
minor: Can't update object #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
major: Internal error (too specific to document in detail)
minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 1: #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
major: Object atom
minor: Unable to decrement reference count #002: H5F.c line 1835 in H5F_close(): can't close file
major: File accessability
minor: Unable to close file #003: H5F.c line 1997 in H5F_try_close(): problems closing file
major: File accessability
minor: Unable to close file #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
major: File accessability
minor: Write failed #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
major: Virtual File Layer
minor: Can't update object #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
major: Internal error (too specific to document in detail)
minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[iw-h43-29:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[iw-h43-29:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
Just a wild guess, but I have seen similar error messages when accidentally
calling H5Aclose on a dataset handle (which should be closed with H5Dclose...),
i.e., make sure you are calling the right H5?close on handles.
I am trying to help one of our researchers for running MPB (http://ab-initio.mit.edu/wiki/index.php/MIT_Photonic_Bands) on our clusters, which uses HDF5. For OpenMPI based compilations, the process just hangs. For Mvapich2 stack, HDF fails with the error message copied below.
This only happens when using multiple cores/nodes and the code works well for a sequential run. I could not make sense of the error messages and blindly tested a few compiler/MPI combinations, but none seem to work. I will appreciate *any* suggestions for fixing or troubleshooting this problem!
Thanks in advance,
-Mehmet
==========================
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 0: #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
major: Object atom
minor: Unable to decrement reference count #002: H5F.c line 1835 in H5F_close(): can't close file
major: File accessability
minor: Unable to close file #003: H5F.c line 1997 in H5F_try_close(): problems closing file
major: File accessability
minor: Unable to close file #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
major: File accessability
minor: Write failed #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
major: Virtual File Layer
minor: Can't update object #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
major: Internal error (too specific to document in detail)
minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
HDF5-DIAG: Error detected in HDF5 (1.8.10-patch1) MPI-process 1: #000: H5F.c line 2058 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file #001: H5I.c line 1479 in H5I_dec_app_ref(): can't decrement ID ref count
major: Object atom
minor: Unable to decrement reference count #002: H5F.c line 1835 in H5F_close(): can't close file
major: File accessability
minor: Unable to close file #003: H5F.c line 1997 in H5F_try_close(): problems closing file
major: File accessability
minor: Unable to close file #004: H5F.c line 1142 in H5F_dest(): low level truncate failed
major: File accessability
minor: Write failed #005: H5FD.c line 1897 in H5FD_truncate(): driver truncate request failed
major: Virtual File Layer
minor: Can't update object #006: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): MPI_File_set_size failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed #007: H5FDmpio.c line 1984 in H5FD_mpio_truncate(): Invalid argument, error stack:
MPI_FILE_SET_SIZE(74): Inconsistent arguments to collective routine
major: Internal error (too specific to document in detail)
minor: MPI Error String
CHECK failure on line 400 of matrixio.c: error closing HDF file
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[iw-h43-29:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[iw-h43-29:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)