We have changed the lustre mount options (added -flock) appropriately and
that has alleviated the previous errors we were seeing. Many thanks to Rob
Latham for confirming that this was the appropriate course of action. Now,
however, I am seeing some new errors, still very cryptic and less than
informative. The errors, attached here, mention memory corruption and
memory mapping issues. This is rather ominous.
As Rob pointed out, however, we are using an old version of mvapich,
mvapich-1.1.0-qlc. (We need this to support legacy codes.) I also recall
some discussion in the HDF5 documentation about the prevalence of bugs in
MPI implementations. Could this be an issue with our MPI implementation?
Here are some more details about when the bug shows up: The bug/error only
occurs when performing collective h5dwrite_f operations when some of the
processes have selected null hyperslabs (h5sselect_none_f) some of the time
(it appears to depend on the topology of MPI ranks with null selections).
If the data transfer property is set to individual, and only the MPI ranks
with data to write make calls to h5dwrite_f then the data is written
successfully. Individual IO is prohibitively slow, more than 2 orders of
magnitude slower than collective IO and would cause this portion of the IO
to take as long, or longer than the computation portion of the simulation.
Lastly I want to thank everyone one on the list for their patience with
me--I have been asking for a lot of help recently, but your responses have
been incredibly helpful.
Thank you all so much,
Izaak Beekman
error (13.2 KB)
···
===================================
(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu
UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu