Performance anomalies

Nikhil_Laghave1 · November 3, 2008, 8:05pm

Hello,

I wanted some expert opinion for some HDF5 work that I am doing. To simulate
the IO of our nuclear physics code, I had made a test program. Now I have a
HDF5 file(quite large). So the program reads this HDF5 file(Here I measure the
read performance) and the I write out this data into another HDF5 file(Here I
measure the write performance).

Now on measuring the timings, I am getting some weird timings. For ~2GB file,
the read time is usually quite fast(with some outliers) but the write times are
quite large. For ex. Read time = 4 secs and Write time = 40 secs(approx). The
wrapper that I have written for the reads and writes are very similar and I
don't see much implementation difference in the reads and writes. Could someone
give me some suggestion as to what I may be doing wrong ?

One more thing is that although the file size is around 2GB, one read/write
operation only involves around 9MB of data....For example, 15 processors
writing 9 MB and this runs in a loop for 15 times(15*15*9 = ~2GB). Could this
small size be causing the slow speed ?

I am attaching the code for this program and a small HDF5 file to illustrate
how it works. I would appreciate if someone can point what I am doing wrong.

HDFIO.zip (976 KB)

···

--
Regards,
Nikhil

robl · November 3, 2008, 10:43pm

9 MB of data per process should be just fine (well, not knowing
anything about your parallel file system or network, but generally
anything over 1MB is good -- more is always better, though :> )

One thing that immediately jumps out at me is that you have explicitly
enabled independent i/o in both the read and write cases
(H5FD_MPIO_INDEPENDENT_F). Wouldn't it be better if you used
collective I/O ? I only looked at your code quickly... am I missing
some some aspect of your hyperslab layout that would make indep better
than collective?

Is that enough to explain a 10x performance difference between reads
and writes? Maybe... depends on your file system. What FS are you
using? Also, what MPI implementation is this?

==rob

···

On Mon, Nov 03, 2008 at 02:05:50PM -0600, Nikhil Laghave wrote:

Now on measuring the timings, I am getting some weird timings. For
~2GB file, the read time is usually quite fast(with some outliers)
but the write times are quite large. For ex. Read time = 4 secs and
Write time = 40 secs(approx). The wrapper that I have written for
the reads and writes are very similar and I don't see much
implementation difference in the reads and writes. Could someone
give me some suggestion as to what I may be doing wrong ?

One more thing is that although the file size is around 2GB, one
read/write operation only involves around 9MB of data....For
example, 15 processors writing 9 MB and this runs in a loop for 15
times(15*15*9 = ~2GB). Could this small size be causing the slow
speed ?

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Nikhil_Laghave1 · November 3, 2008, 11:11pm

Hi,

Well, the only reason I am using Independent IO is that Collective IO was
performing slower than Independent IO. I don't think that there is anything
specific to the data layout to make independent IO faster, but still to my
surprise Independent IO was faster.

I think the implementation is quite simple and expected better results with
collective IO. (I am basically writing out a big 1D array of eigenvectors, so
hyperslab selection was straightforward). The timings are measured on the
cluster franklin by NERSC.

http://www.nersc.gov/nusers/resources/franklin/

The file system in LUSTRE on franklin and MPI implementation is MPICH2.

Thank you.

Nikhil

Now on measuring the timings, I am getting some weird timings. For
~2GB file, the read time is usually quite fast(with some outliers)
but the write times are quite large. For ex. Read time = 4 secs and
Write time = 40 secs(approx). The wrapper that I have written for
the reads and writes are very similar and I don't see much
implementation difference in the reads and writes. Could someone
give me some suggestion as to what I may be doing wrong ?

One more thing is that although the file size is around 2GB, one
read/write operation only involves around 9MB of data....For
example, 15 processors writing 9 MB and this runs in a loop for 15
times(15*15*9 = ~2GB). Could this small size be causing the slow
speed ?

9 MB of data per process should be just fine (well, not knowing
anything about your parallel file system or network, but generally
anything over 1MB is good -- more is always better, though :> )

One thing that immediately jumps out at me is that you have explicitly
enabled independent i/o in both the read and write cases
(H5FD_MPIO_INDEPENDENT_F). Wouldn't it be better if you used
collective I/O ? I only looked at your code quickly... am I missing
some some aspect of your hyperslab layout that would make indep better
than collective?

Is that enough to explain a 10x performance difference between reads
and writes? Maybe... depends on your file system. What FS are you
using? Also, what MPI implementation is this?

==rob

···

On Mon, Nov 3, 2008 at 4:43 PM, Robert Latham <robl@mcs.anl.gov> wrote:
On Mon, Nov 03, 2008 at 02:05:50PM -0600, Nikhil Laghave wrote:

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

--
Regards,
Nikhil

Regards,
Nikhil

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

robl · November 3, 2008, 11:44pm

Hi,

Well, the only reason I am using Independent IO is that Collective IO was
performing slower than Independent IO. I don't think that there is anything
specific to the data layout to make independent IO faster, but still to my
surprise Independent IO was faster.

I think the implementation is quite simple and expected better results with
collective IO. (I am basically writing out a big 1D array of eigenvectors, so
hyperslab selection was straightforward). The timings are measured on the
cluster franklin by NERSC.

Oh! I didn't notice it was 1D. Then yes, at this scale (15
processors) and this access pattern (essentially contiguous),
collective I/O would only introduce overhead with no benefit. You
could set some hints to tune this, but they won't get to the real
issue here:

http://www.nersc.gov/nusers/resources/franklin/

The file system in LUSTRE on franklin and MPI implementation is MPICH2.

Ok, the somewhat-detailed technical answer follows, but the short
answer is that parallel reads from a single lustre file are much
faster than parallel writes. (I consider this a defect in the
MPI-IO/Lustre interface -- one which groups are working to address,
fortunately, but it will take some time for those efforts to make
their way onto Franklin I'm afraid.)

Here's one workaround that you can do in your application. Do you know
how to set MPI-IO hints through HDF5? One thing you can do to speed
up writes is to turn on collective I/O but then force all I/O through
a single processor. Do so by setting "cb_nodes" to "1" (the string
"1").

So, what's going on with your code? Here's that more-detailed
answer:

In the read case, the data does not change and so you can have all 15
processes read at the same time and Lustre will not attempt to
serialize those operations.

In the write case, however, the first process to reach "write" will
acquire a lock on the entire file.

Then when the second process hits "write", it will force the first
process to revoke most of its lock, process 2 will then take *its*
lock.

This process goes on and on for these N processes. A writer comes in,
forces a lock revocation, and then acquires a lock. All very costly
operations.

There's not much the HDF5 library can do in this case. This is a file
system defect -- one that the MPI-IO library can address, but not one
that HDF5 can fix very well.

I would suggest contacting the NERSC support staff about this issue.
They are good people and know more about how to coax performance out
of Lustre than I do.

Sorry I don't have better news for you, but I bet the HDF5 guys are
happy I'm giving them a pass on this :>

==rob

···

On Mon, Nov 03, 2008 at 05:11:38PM -0600, Nikhil Laghave wrote:

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.