large hdf5 read operation freezes without error message


The attached small (~300 line) Fortran program seems to freeze or pause
indefinitely, and does not report any error messages. Can anyone here
reproduce the behavior? Try running with 128 or 256 processors. Be
warned: it may write about 350 GB of data to your filesystem.

Some background:

This code is excerpted and sanitized from a larger code. Domain
decomposition is controlled by the coords(1) and coords(2) variables. Each
process has a unique pair of values obtained from the call on line #35.
Only processes with coords(1) == 0 write to the file (in line #202 and
#273). The actual values being written to the file in this example are
meaningless. I'm using:

        OpenMPI Version 1.4.3
        HDF5 Version 1.8.7
        SLES 10.2 (Linux susedev1 #1 SMP Tue May 6
12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux)
        Intel Fortran/C/C++ Version 11.1.046 (Build 20090630)

This problem doesn't occur with a smaller data set, but I don't understand
how I could be running out of memory in this situation. I'm running this
test on a set of machines where each CPU core has 8 GB of available memory.
(When I run with 128 processes, there are 16 blades running the test.
Each blade has two quad-core processors and 64 GB of memory.)

By my calculation, for each process, the two arrays of significant size are:

  phim = (16*443*440*13*10 real numbers) * (4 bytes/real) / (1e9
  phim = 1.62 GB

  phim_loc = (16*443*440*13 real numbers) * (4 bytes/real) / (1e9
  phim_loc = 0.162 GB

These should easily fit within an 8 GB envelope. Also note, I'm attempting
to run this on a Terascala Lustre filesystem.
Any ideas as to what may be going wrong here?


hdf5_mpi_freeze.f90 (9.37 KB)