possible MPI implementation error?

Zaak_Beekman · November 9, 2011, 9:36pm

We have changed the lustre mount options (added -flock) appropriately and
that has alleviated the previous errors we were seeing. Many thanks to Rob
Latham for confirming that this was the appropriate course of action. Now,
however, I am seeing some new errors, still very cryptic and less than
informative. The errors, attached here, mention memory corruption and
memory mapping issues. This is rather ominous.

As Rob pointed out, however, we are using an old version of mvapich,
mvapich-1.1.0-qlc. (We need this to support legacy codes.) I also recall
some discussion in the HDF5 documentation about the prevalence of bugs in
MPI implementations. Could this be an issue with our MPI implementation?

Here are some more details about when the bug shows up: The bug/error only
occurs when performing collective h5dwrite_f operations when some of the
processes have selected null hyperslabs (h5sselect_none_f) some of the time
(it appears to depend on the topology of MPI ranks with null selections).
If the data transfer property is set to individual, and only the MPI ranks
with data to write make calls to h5dwrite_f then the data is written
successfully. Individual IO is prohibitively slow, more than 2 orders of
magnitude slower than collective IO and would cause this portion of the IO
to take as long, or longer than the computation portion of the simulation.

Lastly I want to thank everyone one on the list for their patience with
me--I have been asking for a lot of help recently, but your responses have
been incredibly helpful.

Thank you all so much,
Izaak Beekman

error (13.2 KB)

···

===================================
(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu

UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu

robl · November 9, 2011, 10:10pm

MVAPICH-1.1 is based on mpich-1.2.7 (that's mpich, not mpich2: it's 7
years old).

You should probably check with valgrind just to make sure you are not
doing anything bad with memory. Probably you are ok on that regard,
but valgrind will tell you for sure (mpiexec -np whatever valgrind
--log-file=myprogram.%p.vg myprogram )

Since you are stuck with mvapich-1.1 you will have to go out of your
way a bit to make collective writes work:

- you know which processors have data and which ones do not (or else
you would not be able to call h5sselect_none_f).

- With this information you can call MPI_COMM_SPLIT

MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)
INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR

where "color" would be either "have data" or "don't have data".

Processors that don't have data get to sit out this iteration.

Processors that have data participate in collective I/O: instead of
passing in MPI_COMM_WORLD, pass in the NEWCOMM from MPI_COMM_SPLIT.

I supsect the "don't have data" processors change from iteration to
iteration. I guess you'll have to do the benchmark to see if "open,
write collectively, close" is still faster than "write independently)

==rob

···

On Wed, Nov 09, 2011 at 04:36:30PM -0500, Zaak Beekman wrote:

Here are some more details about when the bug shows up: The bug/error only
occurs when performing collective h5dwrite_f operations when some of the
processes have selected null hyperslabs (h5sselect_none_f) some of the time
(it appears to depend on the topology of MPI ranks with null selections).
If the data transfer property is set to individual, and only the MPI ranks
with data to write make calls to h5dwrite_f then the data is written
successfully. Individual IO is prohibitively slow, more than 2 orders of
magnitude slower than collective IO and would cause this portion of the IO
to take as long, or longer than the computation portion of the simulation.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Zaak_Beekman · November 9, 2011, 10:51pm

MVAPICH-1.1 is based on mpich-1.2.7 (that's mpich, not mpich2: it's 7
years old).

You should probably check with valgrind just to make sure you are not
doing anything bad with memory. Probably you are ok on that regard,
but valgrind will tell you for sure (mpiexec -np whatever valgrind
--log-file=myprogram.%p.vg myprogram )

Thanks for this tip... I'll double check this at some point, but I have
laboriously quadrupal checked the code for issues like these so i doubt it
will trun up anything, but it can't hurt to have a look.

Since you are stuck with mvapich-1.1 you will have to go out of your
way a bit to make collective writes work:

- you know which processors have data and which ones do not (or else
you would not be able to call h5sselect_none_f).

- With this information you can call MPI_COMM_SPLIT

MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)
INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR

where "color" would be either "have data" or "don't have data".

Processors that don't have data get to sit out this iteration.

Processors that have data participate in collective I/O: instead of
passing in MPI_COMM_WORLD, pass in the NEWCOMM from MPI_COMM_SPLIT.

I supsect the "don't have data" processors change from iteration to
iteration. I guess you'll have to do the benchmark to see if "open,
write collectively, close" is still faster than "write independently)

Actually, the processors that have data to write is known once the
simulation reads in the input file. The problem is, we are ouputing planar
slices of our 3D domain at high frequency, and the number of slices and
their orientation, position, extent and subsampling is specified in the
input file. So the number of planes we write might be as many as ten, and
taken together, they could involve every single MPI rank. These planes are
written every 10 to 40 iterations (restart dumps are produced typically
after at least 2000 iterations). So we could use MPI split to generate
communicators for each slice at the beginning of the program execution, but
closing and reopening the file with the new communicator O(10) times every
10-40 iterations sounds like it could get costly. We need to be able to
write all the planes in ~ 10 seconds in order to avoid serious load
balancing issues.

I think that for the time being I should just pick planes that don't cause
this error to pop up, (so far the data that we REALLY care about doesn't
cause this issue to arise... I can do collective IO on some slices even if
they have ranks which have null selections) and then at a later date work
on migrating the code base to be compatible with more modern mpich/mvapich
releases.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Izaak Beekman

···

On Wed, Nov 9, 2011 at 5:10 PM, <hdf-forum-request@hdfgroup.org> wrote:

(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu

UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

possible MPI implementation error?

On Wed, Nov 9, 2011 at 5:10 PM, <hdf-forum-request@hdfgroup.org> wrote: