problems with parallel I/O

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

···

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

It would be really helpful to see the state of these processes when a
hang occurs. Are they stuck in an i/o call? stuck in a collective
because not everyone participated? if they are stuck in a collective,
is it an I/O collective or a messaging collective?

How parallel is this program? If we're talking 4-way or 8-way
parallelism then maybe one can run it in gdb and collect a backtrace
of all the processors? (mpiexec -np 8 xterm -e gdb ...)

==rob

···

On Tue, Oct 26, 2010 at 04:52:22PM -0600, Dave Wade-Stein wrote:

We have a customer for whom parallel I/O is hanging, and they are
using -id as described above. We're trying to pinpoint why parallel
I/O is not working on their system, which is CentOS 5.5 cluster.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Mark,

The same code hangs on the customer machine, but works fine on our clusters. Would that be possible if some subset aren't participating in the I/O?

Thanks,
Dave

···

On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave

I guess it could depend on the MPI library, but most likely not. What
parallel file system is used on the customer's machine? Mark

···

On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

Mark,

The same code hangs on the customer machine, but works fine on our clusters. Would that be possible if some subset aren't participating in the I/O?

Thanks,
Dave

On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

As to MPI, we're both using openmpi 1.4.1.

We're both using NFS file systems which are formatted as xfs. As I mentioned, we had problems with ext3 filesystems, which were alleviated when we reformatted as xfs. Unfortunately, that didn't work for the customer.

Thanks,
Dave

···

On Oct 26, 2010, at 5:36 PM, Mark Howison wrote:

I guess it could depend on the MPI library, but most likely not. What
parallel file system is used on the customer's machine? Mark

On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

Mark,

The same code hangs on the customer machine, but works fine on our clusters. Would that be possible if some subset aren't participating in the I/O?

Thanks,
Dave

On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Have you tried using a benchmark like IOR to stress the NFS file
system? Maybe it is a problem with NFS and not the underlying file
system or HDF5. Mark

···

On Tue, Oct 26, 2010 at 7:39 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

As to MPI, we're both using openmpi 1.4.1.

We're both using NFS file systems which are formatted as xfs. As I mentioned, we had problems with ext3 filesystems, which were alleviated when we reformatted as xfs. Unfortunately, that didn't work for the customer.

Thanks,
Dave

On Oct 26, 2010, at 5:36 PM, Mark Howison wrote:

I guess it could depend on the MPI library, but most likely not. What
parallel file system is used on the customer's machine? Mark

On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

Mark,

The same code hangs on the customer machine, but works fine on our clusters. Would that be possible if some subset aren't participating in the I/O?

Thanks,
Dave

On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

As the guy responsible for the MPI-IO library underneath HDF5, I can
tell you that NFS is an awful awful choice for parallel I/O. The
MPI-IO library will make a best effort to ensure correct behavior, but
NFS consistency semantics are such that you really cannot guarantee
correct behavior.

the MPI-IO library (ROMIO) wraps each i/o operation in an "fcntl lock"
in an effort to ensure that client-side data gets flushed. Those
fcntl locks are advisory, but even so some times servicing those lock
calls can take an inordinately long time.

As a disclaimer, I'm closely affiliated with the PVFS project

I'd suggest setting up PVFS:
- it is both no-cost and open source,
- it's fairly straightforward to build, install and configure
- it requires only a small kernel module (and in fact you don't
  strictly need that for MPI-IO).
- the MPI-IO library contains PVFS-specific optimizations.

You could run a one-server PVFS instance on your NFS server.

==rob

···

On Tue, Oct 26, 2010 at 05:39:09PM -0600, Dave Wade-Stein wrote:

As to MPI, we're both using openmpi 1.4.1.

We're both using NFS file systems which are formatted as xfs. As I mentioned, we had problems with ext3 filesystems, which were alleviated when we reformatted as xfs. Unfortunately, that didn't work for the customer.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Wasn't aware of IOR, thank for the tip. We'll give that a try.

Dave

···

On Oct 26, 2010, at 5:45 PM, Mark Howison wrote:

Have you tried using a benchmark like IOR to stress the NFS file
system? Maybe it is a problem with NFS and not the underlying file
system or HDF5. Mark

On Tue, Oct 26, 2010 at 7:39 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

As to MPI, we're both using openmpi 1.4.1.

We're both using NFS file systems which are formatted as xfs. As I mentioned, we had problems with ext3 filesystems, which were alleviated when we reformatted as xfs. Unfortunately, that didn't work for the customer.

Thanks,
Dave

On Oct 26, 2010, at 5:36 PM, Mark Howison wrote:

I guess it could depend on the MPI library, but most likely not. What
parallel file system is used on the customer's machine? Mark

On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

Mark,

The same code hangs on the customer machine, but works fine on our clusters. Would that be possible if some subset aren't participating in the I/O?

Thanks,
Dave

On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:

Hi Dave,

One common hang with collective-mode parallel I/O in HDF5 is when only
a subset of processes are participating in the I/O, but the other
processes haven't made an empty selection (to say that they are not
participating) using H5Sselect_none(). Also, have you tried
experimenting with collective vs. independent mode?

Mark

On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <dws@txcorp.com> wrote:

We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation code. For the most part, it works fine, but on certain machines (e.g., early Cray and BG/P) and certain types of filesystems, we've noticed that parallel I/O hangs, so we instituted a -id (individual dump) option which causes each MPI rank to dump its own hdf5 file, and once the simulation is complete, we merge the individual dump files.

We have a customer for whom parallel I/O is hanging, and they are using -id as described above. We're trying to pinpoint why parallel I/O is not working on their system, which is CentOS 5.5 cluster.

In the past we ourselves have had problems with parallel I/O failing on ext3 filesystems, so we reformatted as XFS and the problem went away. Our customer did this, but the problem still persists.

Anyone have any words of wisdom as to what other things could cause parallel I/O to hang?

Thanks for any help!
Dave