I am new to MPI and I try to run it with parallel HDF5 . So I setup a
cluster of 2 nodes with GFS2 and DRBD and in my shared folder I compiled
the example provided in:
http://www.hdfgroup.org/HDF5/Tutor/pcrtaccd.html<https://mailbox.corp.sopra/owa/redir.aspx?C=nPwOrkPk6U-pPdJHB1jNffnBw6gfNtAIBnlB_ZBwfZ1HxR0wLiLngpKoN8z4xoeOLYImVbuZbFg.&URL=http%3A%2F%2Fwww.hdfgroup.org%2FHDF5%2FTutor%2Fpcrtaccd.html>.
When I try to run it in the shared file this error occurred : Fatal error
in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(176)............: MPI_Comm_dup(MPI_COMM_WORLD,
new_comm=0xff9a56f8)
failed PMPI_Comm_dup(161)............:
MPIR_Comm_dup_impl(55)........:
MPIR_Comm_copy(967)...........: MPIR_Get_contextid(521).......:
MPIR_Get_contextid_sparse(683): MPIR_Allreduce_impl(712)......:
MPIR_Allreduce_intra(357).....:
dequeue_and_set_error(596)....: Communication error with rank 0
HDF5: infinite loop closing library
D,T,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(176)............: MPI_Comm_dup(MPI_COMM_WORLD,
new_comm=0xbf940f28)
failed PMPI_Comm_dup(161)............:
MPIR_Comm_dup_impl(55)........:
MPIR_Comm_copy(967)...........:
MPIR_Get_contextid(521).......:
MPIR_Get_contextid_sparse(683):
MPIR_Allreduce_impl(712)......:
MPIR_Allreduce_intra(357).....:
dequeue_and_set_error(596)....: Communication error with rank 1
HDF5: infinite loop closing library
D,T,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
When I run the code in single node it work well and also when I execute
the Hello world example in my cluster .
Could anyou help me to figure out what is the origin of the problem ?
Thank you .