MPI, HDF, NetApps/NFS and locking

Hello HDFers--
We run a medium-scale MPI application that uses HDF both as input and output
on anywhere from 1-32 nodes. We run on Linux RHEL 4u7 against an NFS3 share
on a NetApps filer (though the exact version there is unknown to us.) One
application ("Temperature") will write a set of HDF files, then when
finished, another application ("Pressure") will pick those up as inputs for
the next stage. The applications prefer to imagine their inputs are all in
the same directory. They do not run simultaneously. If a user wants to run
two different pressure scenarios off the outputs of one Temperature
scenario, we create links (hard or soft) from the original "Temperature"
directory into the two new "Pressure" directories. This has worked great for
some time, but in the last 6 weeks, something somewhere has changed to where
we now get file locking errors when "Pressure" tries to read a file which is
a link. When we create a copy of the file instead, it works fine. This could
mean that the "Temperature" application is somehow leaking locks, or it
could mean that HDF or MPI (or the underlying filer, a NetApps device)
doesn't like links. We're not happy with either theory, as this used to
work.

We are hunting for ideas on how to determine what has changed. The code has
not. A new lockd implementation? Some sort of NFS toggle? Any tools people
can suggest to see why we are getting these errors? What are the sorts of
things that happen to make ADIOI_Set_lock() (ultimately fcntl) unhappy?

A typical error, emitted on one or two of the nodes looks like

Error ADIOI_Set_lock() - rank = 2, err = -1, errno = 37, type = 1, offset =
2050, whence = 0, len = 2

Errno=37 seems to mean ENOLCK ("no record locks available"), but a
system-wide lock starvation scenario seems unlikely as the very same node
against the very same NFS mount will happily run against a copy of the
original, as opposed to a link to the original.

Any ideas would be very welcome.

Cheers,

Sebastian

So, my questions are probably silly but I'll at least throw them out
there...

You say the 'code has not changed.' Is it statically or dynamically
linked? If later, are you certain that all the shared libs its loading
are the same? Has the code been re-compiled or re-linked recently? Based
on your email, I'd assume not but I had to ask.

Probably pretty useless questions but...

Mark

···

On Wed, 2009-08-26 at 16:08, Sebastian Good wrote:

Hello HDFers--

We run a medium-scale MPI application that uses HDF both as input and
output on anywhere from 1-32 nodes. We run on Linux RHEL 4u7 against
an NFS3 share on a NetApps filer (though the exact version there is
unknown to us.) One application ("Temperature") will write a set of
HDF files, then when finished, another application ("Pressure") will
pick those up as inputs for the next stage. The applications prefer to
imagine their inputs are all in the same directory. They do not run
simultaneously. If a user wants to run two different pressure
scenarios off the outputs of one Temperature scenario, we create links
(hard or soft) from the original "Temperature" directory into the two
new "Pressure" directories. This has worked great for some time, but
in the last 6 weeks, something somewhere has changed to where we now
get file locking errors when "Pressure" tries to read a file which is
a link. When we create a copy of the file instead, it works fine. This
could mean that the "Temperature" application is somehow leaking
locks, or it could mean that HDF or MPI (or the underlying filer, a
NetApps device) doesn't like links. We're not happy with either
theory, as this used to work.

We are hunting for ideas on how to determine what has changed. The
code has not. A new lockd implementation? Some sort of NFS toggle? Any
tools people can suggest to see why we are getting these errors? What
are the sorts of things that happen to make ADIOI_Set_lock()
(ultimately fcntl) unhappy?

A typical error, emitted on one or two of the nodes looks like

Error ADIOI_Set_lock() - rank = 2, err = -1, errno = 37, type = 1,
offset = 2050, whence = 0, len = 2

Errno=37 seems to mean ENOLCK ("no record locks available"), but a
system-wide lock starvation scenario seems unlikely as the very same
node against the very same NFS mount will happily run against a copy
of the original, as opposed to a link to the original.

Any ideas would be very welcome.

Cheers,

Sebastian

______________________________________________________________________
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

At this point no questions are useless! It's the same binaries, not even
recompiled or linked.

···

On Wed, Aug 26, 2009 at 6:19 PM, Mark Miller <miller86@llnl.gov> wrote:

So, my questions are probably silly but I'll at least throw them out
there...

You say the 'code has not changed.' Is it statically or dynamically
linked? If later, are you certain that all the shared libs its loading
are the same? Has the code been re-compiled or re-linked recently? Based
on your email, I'd assume not but I had to ask.

Probably pretty useless questions but...

Mark

On Wed, 2009-08-26 at 16:08, Sebastian Good wrote:
> Hello HDFers--
>
> We run a medium-scale MPI application that uses HDF both as input and
> output on anywhere from 1-32 nodes. We run on Linux RHEL 4u7 against
> an NFS3 share on a NetApps filer (though the exact version there is
> unknown to us.) One application ("Temperature") will write a set of
> HDF files, then when finished, another application ("Pressure") will
> pick those up as inputs for the next stage. The applications prefer to
> imagine their inputs are all in the same directory. They do not run
> simultaneously. If a user wants to run two different pressure
> scenarios off the outputs of one Temperature scenario, we create links
> (hard or soft) from the original "Temperature" directory into the two
> new "Pressure" directories. This has worked great for some time, but
> in the last 6 weeks, something somewhere has changed to where we now
> get file locking errors when "Pressure" tries to read a file which is
> a link. When we create a copy of the file instead, it works fine. This
> could mean that the "Temperature" application is somehow leaking
> locks, or it could mean that HDF or MPI (or the underlying filer, a
> NetApps device) doesn't like links. We're not happy with either
> theory, as this used to work.
>
> We are hunting for ideas on how to determine what has changed. The
> code has not. A new lockd implementation? Some sort of NFS toggle? Any
> tools people can suggest to see why we are getting these errors? What
> are the sorts of things that happen to make ADIOI_Set_lock()
> (ultimately fcntl) unhappy?
>
> A typical error, emitted on one or two of the nodes looks like
>
> Error ADIOI_Set_lock() - rank = 2, err = -1, errno = 37, type = 1,
> offset = 2050, whence = 0, len = 2
>
> Errno=37 seems to mean ENOLCK ("no record locks available"), but a
> system-wide lock starvation scenario seems unlikely as the very same
> node against the very same NFS mount will happily run against a copy
> of the original, as opposed to a link to the original.
>
> Any ideas would be very welcome.
>
> Cheers,
>
> Sebastian
>
> ______________________________________________________________________
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I've run into a similar problem in the past.

In my case I had no control over the filer. What had happened was that someone in IT ops had changed the underlying configuration of the filer. While it was still exposing an NFS share under the hood it was actually using NTFS (or something along those lines) where even having the file open locked access to it. The IT ops group tried to tell us there was nothing wrong with that until we demonstrated otherwise. If you're running on *NIX boxes you can try debugging through strace/ltrace find the exact file operation that is failing (open, seek, read, etc.)

Hopes this gives you some ideas.

···

On Aug 26, 2009, at 6:19 PM, Mark Miller wrote:

So, my questions are probably silly but I'll at least throw them out
there...

You say the 'code has not changed.' Is it statically or dynamically
linked? If later, are you certain that all the shared libs its loading
are the same? Has the code been re-compiled or re-linked recently? Based
on your email, I'd assume not but I had to ask.

Probably pretty useless questions but...

Mark

On Wed, 2009-08-26 at 16:08, Sebastian Good wrote:

Hello HDFers--

We run a medium-scale MPI application that uses HDF both as input and
output on anywhere from 1-32 nodes. We run on Linux RHEL 4u7 against
an NFS3 share on a NetApps filer (though the exact version there is
unknown to us.) One application ("Temperature") will write a set of
HDF files, then when finished, another application ("Pressure") will
pick those up as inputs for the next stage. The applications prefer to
imagine their inputs are all in the same directory. They do not run
simultaneously. If a user wants to run two different pressure
scenarios off the outputs of one Temperature scenario, we create links
(hard or soft) from the original "Temperature" directory into the two
new "Pressure" directories. This has worked great for some time, but
in the last 6 weeks, something somewhere has changed to where we now
get file locking errors when "Pressure" tries to read a file which is
a link. When we create a copy of the file instead, it works fine. This
could mean that the "Temperature" application is somehow leaking
locks, or it could mean that HDF or MPI (or the underlying filer, a
NetApps device) doesn't like links. We're not happy with either
theory, as this used to work.

We are hunting for ideas on how to determine what has changed. The
code has not. A new lockd implementation? Some sort of NFS toggle? Any
tools people can suggest to see why we are getting these errors? What
are the sorts of things that happen to make ADIOI_Set_lock()
(ultimately fcntl) unhappy?

A typical error, emitted on one or two of the nodes looks like

Error ADIOI_Set_lock() - rank = 2, err = -1, errno = 37, type = 1,
offset = 2050, whence = 0, len = 2

Errno=37 seems to mean ENOLCK ("no record locks available"), but a
system-wide lock starvation scenario seems unlikely as the very same
node against the very same NFS mount will happily run against a copy
of the original, as opposed to a link to the original.

Any ideas would be very welcome.

Cheers,

Sebastian

______________________________________________________________________
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

We are hunting for ideas on how to determine what has changed. The code has
not. A new lockd implementation? Some sort of NFS toggle? Any tools people
can suggest to see why we are getting these errors? What are the sorts of
things that happen to make ADIOI_Set_lock() (ultimately fcntl) unhappy?

A typical error, emitted on one or two of the nodes looks like

Error ADIOI_Set_lock() - rank = 2, err = -1, errno = 37, type = 1, offset =
2050, whence = 0, len = 2

I'm not sure what's going on, but this error came from your MPI library.

Errno=37 seems to mean ENOLCK ("no record locks available"), but a
system-wide lock starvation scenario seems unlikely as the very same node
against the very same NFS mount will happily run against a copy of the
original, as opposed to a link to the original.

Your MPI library has to try very very hard to get correct results out
of NFS: NFS consistency semantics are evil with respect to parallel
I/O. I can imagine how symlinks might make the locking routines a
little cranky. Your MPI library is making 'fcntl()' type lock calls,
if that helps.

==rob

···

On Wed, Aug 26, 2009 at 06:08:20PM -0500, Sebastian Good wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA