phdf 5 independent vs collective IO

Hello All,

I'm having some difficulty understanding how performance should differ between independent and collective IO.

At the moment, I'm trying to write regular hyperslabs that span an entire 40GB dataset (writing to lustre, Intel MPI). Independent IO seems to be quite a bit faster (30 second difference on 64 machines). What factors might be contributing to this difference in performance?

Also, in both cases I seem to be getting a strange slowdown at 32 machines. In almost all my tests, 16 and 64 machines both perform better than 32.

Thanks! David

Hello All,

I�m having some difficulty understanding how performance should differ
between independent and collective IO.

At the moment, I�m trying to write regular hyperslabs that span an
entire 40GB dataset (writing to lustre, Intel MPI). Independent IO seems
to be quite a bit faster (30 second difference on 64 machines). What
factors might be contributing to this difference in performance?

While much of Intel MPI is based on MPICH, I cannot say for certain what lustre optimizations they have enabled -- if any.

First, ensure the stripe size for your lustre file is larger than the default of 4. for parallel file access, you should stripe across all OSTs

It looks like Intel-MPI requires an additional environment variable to enable fs-specific optimizations: section 3.5.8 of the Intel MPI Reference Manual suggests you do the following:

* Set the I_MPI_EXTRA_FILESYSTEM environment variable to on to enable parallel file system support

* Set the I_MPI_EXTRA_FILESYSTEM_LIST environment variable to "lustre" for the lustre-optimized driver

https://software.intel.com/sites/products/documentation/hpc/ics/icsxe2013sp1/lin/icsxe_gsg_files/How_to_Use_the_Environment_Variables_I_MPI_EXTRA_FILESYSTEM_and_I_MPI_EXTRA_FILESYSTEM_LIST.htm

If you use the "tuned for lustre" option, do you see better performance?

thanks
==rob

···

On 06/06/2014 08:51 AM, Zmick, David wrote:

Also, in both cases I seem to be getting a strange slowdown at 32
machines. In almost all my tests, 16 and 64 machines both perform better
than 32.

Thanks! David

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Thanks for your reply Rob,

We do have the stripe size set to use all available OSTs as well as the lustre flags for Intel MPI.

With the Lustre optimizations turned on, we still see collective IO top out at 1 gb/sec regardless of the number of machines. Independent scales as we would expect and performs as expected. We also have noticed that the code seems to spend a lot of time in MPI_Allreduce

I am writing to an X by Y by T dataset. Each node writes X/nodes Y by T slices to the dataset. These slices are sequential. So, essentially each slice is doing large, sequential IO to different parts of the file.

I know that using collective IO should hurt this IO somewhat, but should not hurt to the degree we are seeing.

We are not likely to use collective IO, but we would like to find a resource that explains how collective IO actually works today.

Thanks!

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Rob Latham
Sent: Thursday, June 12, 2014 11:39 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] phdf 5 independent vs collective IO

On 06/06/2014 08:51 AM, Zmick, David wrote:

Hello All,

I'm having some difficulty understanding how performance should differ
between independent and collective IO.

At the moment, I'm trying to write regular hyperslabs that span an
entire 40GB dataset (writing to lustre, Intel MPI). Independent IO
seems to be quite a bit faster (30 second difference on 64 machines).
What factors might be contributing to this difference in performance?

While much of Intel MPI is based on MPICH, I cannot say for certain what lustre optimizations they have enabled -- if any.

First, ensure the stripe size for your lustre file is larger than the default of 4. for parallel file access, you should stripe across all OSTs

It looks like Intel-MPI requires an additional environment variable to enable fs-specific optimizations: section 3.5.8 of the Intel MPI Reference Manual suggests you do the following:

* Set the I_MPI_EXTRA_FILESYSTEM environment variable to on to enable parallel file system support

* Set the I_MPI_EXTRA_FILESYSTEM_LIST environment variable to "lustre"
for the lustre-optimized driver

https://software.intel.com/sites/products/documentation/hpc/ics/icsxe2013sp1/lin/icsxe_gsg_files/How_to_Use_the_Environment_Variables_I_MPI_EXTRA_FILESYSTEM_and_I_MPI_EXTRA_FILESYSTEM_LIST.htm

If you use the "tuned for lustre" option, do you see better performance?

thanks
==rob

Also, in both cases I seem to be getting a strange slowdown at 32
machines. In almost all my tests, 16 and 64 machines both perform
better than 32.

Thanks! David

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5