HDF5 (?) crashes computer when interrupting a write

Dear forum members,

I have observed an annoying occurence many times now. I'm running
parallel HDF5 (1.8.14) on top of OpenMPI (1.7.2) with gcc (4.8.1) on a
OpenSuse Linux (13.1). The storage is located on a NFS Server.

Running on typically 4 cores, I'm writing relatively large files (at
least several hundred MB, sometimes many GB) in parallel with HDF5.
Sometimes I have to interrupt the code with a CTRL+C signal during such
a write operation (often because of user error). Occasionally, this will
cause a catastrophic hangup, and I get the error message:
kernel BUG: soft lockup - CPU stuck for 23s!

This will invariably cause a violent system crash after a very short
time. I have observed this on at least 5 different machines (same
software stack), and so I don't believe it is a hardware problem. Since
these lockups only happen during interrupted write operations, I suspect
the HDF5 library to be causing them in some way, possibly not freeing
some resources.

Of course, it could also be caused by OpenMPI. Due to the highly
disruptive nature of the problem, I am not keen to try it too often. I
cannot easily try a different (or newer) MPI implementation. It might
also be caused by the fact that I'm not writing to a physical drive, but
a NFS drive.

Hence a general question, without appending example code: Has anyone
observed this behavior before, and if so, is there a fix? Am I blaming
HDF5 unfairly, and another cause is more likely? If this error is
unheard of, it's most likely caused by my setup...

Thanks,
Wolf

···

--

Dear forum members,

I have observed an annoying occurence many times now. I'm running
parallel HDF5 (1.8.14) on top of OpenMPI (1.7.2) with gcc (4.8.1) on a
OpenSuse Linux (13.1). The storage is located on a NFS Server.

Running on typically 4 cores, I'm writing relatively large files (at
least several hundred MB, sometimes many GB) in parallel with HDF5.
Sometimes I have to interrupt the code with a CTRL+C signal during such
a write operation (often because of user error). Occasionally, this will
cause a catastrophic hangup, and I get the error message:
kernel BUG: soft lockup - CPU stuck for 23s!

Have you seen <
http://lists.opensuse.org/archive/opensuse-bugs/2014-06/msg01135.html>?
What kernel version are you running?

This will invariably cause a violent system crash after a very short
time. I have observed this on at least 5 different machines (same
software stack), and so I don't believe it is a hardware problem. Since
these lockups only happen during interrupted write operations, I suspect
the HDF5 library to be causing them in some way, possibly not freeing
some resources.

A "kernel BUG" needs to be fixed in the kernel, but of course some
kernel bugs are triggered by buggy user/library code.

···

On Fri, May 22, 2015 at 9:08 AM, Wolf Dapp <wolf.dapp@gmail.com> wrote:

Of course, it could also be caused by OpenMPI. Due to the highly
disruptive nature of the problem, I am not keen to try it too often. I
cannot easily try a different (or newer) MPI implementation. It might
also be caused by the fact that I'm not writing to a physical drive, but
a NFS drive.

Hence a general question, without appending example code: Has anyone
observed this behavior before, and if so, is there a fix? Am I blaming
HDF5 unfairly, and another cause is more likely? If this error is
unheard of, it's most likely caused by my setup...

Thanks,
Wolf

--

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

Have you seen
<http://lists.opensuse.org/archive/opensuse-bugs/2014-06/msg01135.html>?
What kernel version are you running?

Hello George,

thanks for your response. I had indeed not seen this particular bug
report. My searches had revealed plenty of such "soft lockup" problems,
but most of them apparently related to hardware. Alas, as in other bug
reports, no actual reason or patch was mentioned in this one, except
that the error no longer appears in kernel 3.15.

I am indeed running kernel 3.11.10-29-desktop. Unfortunately, I won't be
able to try 3.15 for a while...

A "kernel BUG" needs to be fixed in the kernel, but of course some
kernel bugs are triggered by buggy user/library code.

Since the error is quite reproducible by interrupting a parallel HDF5
write (and has not happened to me under any other circumstance), I was
thinking that it may have to do with HDF5, and that other HDF5 users may
have encountered it.

Cheers,
Wolf

···

Am 22.05.2015 um 16:29 schrieb George N. White III:

--

> Have you seen
> <http://lists.opensuse.org/archive/opensuse-bugs/2014-06/msg01135.html>?
> What kernel version are you running?

Hello George,

thanks for your response. I had indeed not seen this particular bug
report. My searches had revealed plenty of such "soft lockup" problems,
but most of them apparently related to hardware. Alas, as in other bug
reports, no actual reason or patch was mentioned in this one, except
that the error no longer appears in kernel 3.15.

I am indeed running kernel 3.11.10-29-desktop. Unfortunately, I won't be
able to try 3.15 for a while...

The patch looks simple:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/fs/nfs/pagelist.c?id=92a56555bd576c61b27a5cab9f38a33a1e9a1df5

You could prepare a kernel with the patch and next time a system crashes
boot it with the patched kernel.

···

On Fri, May 22, 2015 at 12:08 PM, Wolf Dapp <wolf.dapp@gmail.com> wrote:

Am 22.05.2015 um 16:29 schrieb George N. White III:

> A "kernel BUG" needs to be fixed in the kernel, but of course some
> kernel bugs are triggered by buggy user/library code.

Since the error is quite reproducible by interrupting a parallel HDF5
write (and has not happened to me under any other circumstance), I was
thinking that it may have to do with HDF5, and that other HDF5 users may
have encountered it.

Cheers,
Wolf

--

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

Hello George,

I'll give that a try. It sure sounds as if that could be the issue
troubling me. Thanks for your help and advice!

Cheers,
Wolf

···

Am 22.05.2015 um 17:41 schrieb George N. White III:

On Fri, May 22, 2015 at 12:08 PM, Wolf Dapp <wolf.dapp@gmail.com > <mailto:wolf.dapp@gmail.com>> wrote:

    Am 22.05.2015 um 16:29 schrieb George N. White III:
    > Have you seen
    > <http://lists.opensuse.org/archive/opensuse-bugs/2014-06/msg01135.html>?
    > What kernel version are you running?

    Hello George,

    thanks for your response. I had indeed not seen this particular bug
    report. My searches had revealed plenty of such "soft lockup" problems,
    but most of them apparently related to hardware. Alas, as in other bug
    reports, no actual reason or patch was mentioned in this one, except
    that the error no longer appears in kernel 3.15.

    I am indeed running kernel 3.11.10-29-desktop. Unfortunately, I won't be
    able to try 3.15 for a while...

The patch looks simple:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/fs/nfs/pagelist.c?id=92a56555bd576c61b27a5cab9f38a33a1e9a1df5

You could prepare a kernel with the patch and next time a system crashes
boot it with the patched kernel.

--