Error running HDF5 parallel in more than one host

I am having problems running HDF5 parallel across more than computer in a small cluster. I compiled HDF5 with the Intel compilers and Intel MPI. The compilation went fine, HDF5 passes all the tests (including parallel tests), and my code runs normally when I start an MPI job that only uses processes from one machine.

When I try to run across more than one host, the HDF5 files are corrupted. This seems to happen regardless of the file contents (even seems like writing simple attributes causes the same problem). For example, I run the binary testpar/testphdf5 under the HDF5 tree, and if I do:

mpiexec -n 4 ./testphdf5

It passes all tests. But if I force it to use more than one host (e.g. via --hostfile), it fails, e.g.:

Testing – dataset independent read (idsetr)
Dataset Verify failed at [0][0](row 64, col 0): expect 6401, got 0

I have run other MPI jobs (non-HDF5) across several hosts with no problems. I have also tried both Intel MPI and OpenMPI, and get the same error. This happens with HDF5 from both the 1.8.x branch and 1.10.x branch.

Any ideas on how to fix this problem?

Can you post what parellel file system you have deployed? Orange FS,…?

This is a small cluster and right now we have only a SAN with StorNext, no distributed file system. I’ve also tried on a different directory that was a simple NFS share, and got the same problem.

I am still confused what you’re trying to do. Is it running pHDF5 on a cluster? --if so please note the requirements, I copied and pasted here:

1.1. Requirements

The later part of your answer seems it got truncated. Are you saying that it is not possible to run pHDF5 on SAN?

I am saying that parallel file system is a requirement for pHDF5, which can be found in this document. The collective calls are built on top of some parallel IO implementation like ROMIO. OrangeFS is a free implementation of a parallel FS. NFS and similar may not be POSIX compliant, see flock for NFS.