HDF5 MPI for Xeon Phi

Dear User Group,

I'm currently trying to run a simple MPI-program using Netcdf4.3.3 and hdf5-1.8.16 on multiple Xeon Phi nodes.
Each node contains a host compute node (Ivy-Bridge) and 2 Intel Xeon Phi coprocessors.

The important part of the source code is as follows (the whole code is attached):

// ...

   checkNcError(nc_open_par(meshFile, NC_NETCDF4 | NC_MPIIO, MPI_COMM_WORLD, MPI_INFO_NULL, &ncFile));

   checkNcError(nc_inq_varid(ncFile, "element_size", &ncVarElemSize));

   checkNcError(nc_var_par_access(ncFile, ncVarElemSize,NC_COLLECTIVE));

   size_t start[1] = {rank};
   int elem_size;
   checkNcError(nc_get_var1_int(ncFile, ncVarElemSize, start , &elem_size));

   printf("Rank %d> Number of elements: %d\n",rank, elem_size);

//...

Where
meshFile is an unstructured Grid in NetCDF format, containing a variable
element_size for each rank.
(checkNcError is used to check a nc call for errors)

Running the program on a single node (i.e one host and two coprocessors, 3 ranks) succeeds:

mpiexec -host host -n 1 prog.host cube_36_10_10_3_1_1.nc : -host host-mic1 -n 1 prog.mic cube_36_10_10_3_1_1.nc : -host host-mic0 -n 1 prog.mic cube_36_10_10_3_1_1.nc

Reading file: cube_36_10_10_3_1_1.nc
Rank 0> Number of elements: 6000
Rank 2> Number of elements: 6000
Rank 1> Number of elements: 6000

However executing on two nodes (i.e. 6 ranks with an adjusted mesh file) fails :

mpiexec -host host1 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host1-mic1 -n 1 prog.mic cube_36_10_10_6_1_1.nc : -host host1-mic0 -n 1 prog.mic cube_36_10_10_6_1_1.nc : -host host2 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host2-mic1 -n 1 prog.mic cube_36_10_10_6_1_1.nc : -host host2-mic0 -n 1 prog.mic cube_36_10_10_6_1_1.nc

Reading file: cube_36_10_10_6_1_1.nc

Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(2434)..................: MPI_Bcast(buf=0x10b3ffc, count=1, MPI_INT, root=0, comm=0x84000000) failed
MPIR_Bcast_impl(1807).............:
MPIR_Bcast(1835)..................:
I_MPIR_Bcast_intra(2016)..........: Failure during collective
MPIR_Bcast_intra(1665)............: Failure during collective
MPIR_Bcast_intra(1634)............:
MPIR_Bcast_binomial(245)..........:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2 truncated; 24 bytes received but buffer size is 4
Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(2434)..................: MPI_Bcast(buf=0x208bfdc, count=1, MPI_INT, root=0, comm=0x84000000) failed
MPIR_Bcast_impl(1807).............:
MPIR_Bcast(1835)..................:
I_MPIR_Bcast_intra(2016)..........: Failure during collective
MPIR_Bcast_intra(1665)............: Failure during collective
MPIR_Bcast_intra(1634)............:
MPIR_Bcast_binomial(245)..........:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2 truncated; 24 bytes received but buffer size is 4

To test the environment for 6 ranks I executed the program on just 6 compute nodes, what again succeeded:

mpiexec -host host1 -n 1 prog.host cube_36_10_10_6_1_1.nc: -host host2 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host3 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host4 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host5 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host host6 -n 1 prog.host cube_36_10_10_6_1_1.nc ;

Reading file: cube_36_10_10_6_1_1.nc

Rank 1> Number of elements: 3000
Rank 4> Number of elements: 3000
Rank 3> Number of elements: 3000
Rank 2> Number of elements: 3000
Rank 5> Number of elements: 3000
Rank 0> Number of elements: 3000

Using DDT I was able to determine that the error is somewhere in HD5F_open().

I'm very thankful for any help, kind regards,

Leo

prog.c (1.13 KB)