Hi Wolf,
I found the problem in your program. Note that the hang vs the error stack (from Tim's email) is just different behaviors of different MPI implementations or versions. One implementation hangs when a call to MPI_File_set_size() from inside HDF5 is done with different arguments, and the other actually reports the error.
On to the mistake in your program now.. HDF5 requires the call to H5Dcreate be collective. That doesn't mean only that all processes have to call it, but also all processes have to call it with the same arguments. You are creating a chunked dataset with the same chunked dimensions except on the last process where you edit the first dimension (nxLocal). This happens here:
if ((nx%iNumOfProc) != 0) {
nxLocal += 1;
ixStart = myID*nxLocal;
if (myID == iNumOfProc-1)
nxLocal -= (nxLocal*iNumOfProc-nx); // last proc has less elements
}
You pass nxLocal to the chunk dimensions here:
chunk_dims[0] = nxLocal;
As long as 32*numofprocesses is 0, you don’t modify nxLocal on the last process, which explains why it works in those situations.
Note that it is ok to Read and Write to datasets collectively with different arguments, but you have to create the dataset with the same arguments including the same chunk dimensions. So what you do above causes one process to see a dataset with different chunk sizes in its metadata cache, so on file close time, when processes flush their metadata cache, one process has a different size of the file than the other processes and this is what causes the problem.
Makes sense?
Thanks,
Mohamad
···
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Wolf Dapp
Sent: Tuesday, April 07, 2015 11:30 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2 number of processes
Dear hdf-forum members,
I have a problem I am hoping someone can help me with. I have a program that outputs a 2D-array (contiguous, indexed linearly) using parallel HDF5. When I choose a number of processors that is not a power of 2
(1,2,4,8,...) H5Fclose() hangs, inexplicably. I'm using HDF5 v.1.8.14, and OpenMPI 1.7.2, on top of GCC 4.8 with Linux.
Can someone help me pinpoint my mistake?
I have searched the forum, and the first hit [searching for "h5fclose hangs"] was a user mistake that I didn't make (to the best of my knowledge). The second didn't go on beyond the initial problem description, and didn't offer a solution.
Attached is a (maybe insufficiently bare-boned, apologies) demonstrator program. Strangely, the hang only happens if nx >= 32. The code is adapted from an HDF5 example program.
The demonstrator is compiled with
h5pcc test.hangs.cpp -DVERBOSE -lstdc++
( on my system, for some strange reason, MPI has been compiled with the deprecated C++ bindings. I need to include -lmpi_cxx also, but that shouldn't be necessary for anyone else. I hope that's not the reason for the hang-ups. )
Thanks in advance for your help!
Wolf Dapp
--