I am having problems running HDF5 parallel across more than computer in a small cluster. I compiled HDF5 with the Intel compilers and Intel MPI. The compilation went fine, HDF5 passes all the tests (including parallel tests), and my code runs normally when I start an MPI job that only uses processes from one machine.
When I try to run across more than one host, the HDF5 files are corrupted. This seems to happen regardless of the file contents (even seems like writing simple attributes causes the same problem). For example, I run the binary testpar/testphdf5 under the HDF5 tree, and if I do:
mpiexec -n 4 ./testphdf5
It passes all tests. But if I force it to use more than one host (e.g. via --hostfile), it fails, e.g.:
(…)
Testing – dataset independent read (idsetr)
Dataset Verify failed at [0][0](row 64, col 0): expect 6401, got 0
(…)
I have run other MPI jobs (non-HDF5) across several hosts with no problems. I have also tried both Intel MPI and OpenMPI, and get the same error. This happens with HDF5 from both the 1.8.x branch and 1.10.x branch.
Any ideas on how to fix this problem?