Measuring timings

Hello,

My question might be very silly but I am quite confused and not sure how to
measure the timing of parallel writes/reads.

The situation is this: I have a 10GB HDF5 file written by 100 processors in
parallel in PHDF5. The data in the file is a contiguous 1D array and I am
using Independent IO. These are the 2 approaches I have used and the results
are quite different.

1. Time when last processor finishes writing - Time when first processor
started writing.

2. Each processor measures the write time independently as:

start = MPI_WTIME()
phdfwrite(<data to be written>)
end = MPI_WTIME() - start

This is followed my an Allreduce operation to find the maximum of 'end' across
all processor and this time is measured as the write time.

The timings using the 2 methods are:
Method 1:
HDF5 Read Time is: 3.52931666374207
HDF5 Write Time is: 271.7706000804901

Method 2:
Coll HDF5 Read Time is: 124.8922524452209
Coll HDF5 Write Time is: 393.1416127681732

Which one of these methods is correct ? Or is there some other way to
accurately measure the timings ? Would profiling be a better option ?

Regards,
Nikhil

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hello,

My question might be very silly but I am quite confused and not sure how to
measure the timing of parallel writes/reads.

Not a silly question at all. When it comes to correctly measuring
aggregate I/O performance, lots of benchmarks get it wrong.

The situation is this: I have a 10GB HDF5 file written by 100 processors in
parallel in PHDF5. The data in the file is a contiguous 1D array and I am
using Independent IO. These are the 2 approaches I have used and the results
are quite different.

1. Time when last processor finishes writing - Time when first processor
started writing.

I'm not a huge fan of this approach. How do you know 'first' and
'last'? By MPI rank? There's no gaurantee that rank 0 will start
first or rank N will finish last.

Maybe you do an allgather to find the MIN of start_time and an
allgather to find the MAX of end_time? Well, then you're equivalent
to the next approach (except you incur more overhead).

2. Each processor measures the write time independently as:

start = MPI_WTIME()
phdfwrite(<data to be written>)
end = MPI_WTIME() - start

This is followed my an Allreduce operation to find the maximum of
'end' across all processor and this time is measured as the write
time.

The IOR benchmark and our own 'mpi-io-test' benchmark essentially take
this second approach. Why? In HPC, the I/O phase is either to read
in a data set or write out a checkpoint. In both cases the
application waits for i/o to finish, so there's little to be gained by
knowing that one process ripped through the I/O super-fast while
everyone else was pokey.

Now, I will say that mpi-io-test goes one better. It reports not just
a single I/O time, but also reports the min, max, average, and standard
deviation for the N processors. This is super-helpful, because on
the PVFS file system (where there are no locks) if all goes well the
min, max, and avg are all pretty close, and the stddev is pretty low.

If not, then something is likely wrong with the network, or maybe one
of the servers has a degraded raid array.

(I suspect the numbers look quite a bit different for MPI-IO to
lustre, where that file system's locking infrastructure gets in the
way of well-coordinated parallel I/O, and would not at all be
surprised by a high standard deviation).

The timings using the 2 methods are:
Method 1:
HDF5 Read Time is: 3.52931666374207
HDF5 Write Time is: 271.7706000804901

Method 2:
Coll HDF5 Read Time is: 124.8922524452209
Coll HDF5 Write Time is: 393.1416127681732

Which one of these methods is correct ? Or is there some other way to
accurately measure the timings ? Would profiling be a better option ?

Well, they are both "right" as long as you are clear in what you are
measuring, but I think method #2 gives more meaningful results.

Actually, comparing the two is quite interesting, too, though to
really understand what's going on with that low read time, you'd have
to, for example, use MPE and Jumpshot to really see what happened.

OK, so short answer is to time everyone's I/O time and take the
maximum, but when it comes to benchmarks, more information is better
so if it's not a big pain to also report the min and avg, do that as
well.

Oh, and Nikhil and I know why write is so much higher than reads, but
here's the quick lowdown for everyone else. It has to do with the
poor Lustre support in the MPI-IO library installed on Franklin --
something I and others are working to address in the very near future.

Excellent question!
==rob

···

On Sat, Dec 06, 2008 at 05:22:47PM -0600, Nikhil Laghave wrote:

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Robert,

Thanks for the detailed reply. It was a great help.

One thing I noticed that the method 1 that I mentioned is actually an
overestimate of the timings. The read+write times using method 2 is greater
that the total runtime of the program.

Method 2 on the other hand reflects both timings correctly(i.e read+write time
= total runtime).

Also, I have made a few programs for simulating the IO of our nuclear physics
code. I can send it to since you guys are interested in studying various IO
kernels.

Regards,
Nikhil

Hello,

My question might be very silly but I am quite confused and not sure how to
measure the timing of parallel writes/reads.

Not a silly question at all. When it comes to correctly measuring
aggregate I/O performance, lots of benchmarks get it wrong.

The situation is this: I have a 10GB HDF5 file written by 100 processors in
parallel in PHDF5. The data in the file is a contiguous 1D array and I am
using Independent IO. These are the 2 approaches I have used and the results
are quite different.

1. Time when last processor finishes writing - Time when first processor
started writing.

I'm not a huge fan of this approach. How do you know 'first' and
'last'? By MPI rank? There's no gaurantee that rank 0 will start
first or rank N will finish last.

Maybe you do an allgather to find the MIN of start_time and an
allgather to find the MAX of end_time? Well, then you're equivalent
to the next approach (except you incur more overhead).

2. Each processor measures the write time independently as:

start = MPI_WTIME()
phdfwrite(<data to be written>)
end = MPI_WTIME() - start

This is followed my an Allreduce operation to find the maximum of
'end' across all processor and this time is measured as the write
time.

The IOR benchmark and our own 'mpi-io-test' benchmark essentially take
this second approach. Why? In HPC, the I/O phase is either to read
in a data set or write out a checkpoint. In both cases the
application waits for i/o to finish, so there's little to be gained by
knowing that one process ripped through the I/O super-fast while
everyone else was pokey.

Now, I will say that mpi-io-test goes one better. It reports not just
a single I/O time, but also reports the min, max, average, and standard
deviation for the N processors. This is super-helpful, because on
the PVFS file system (where there are no locks) if all goes well the
min, max, and avg are all pretty close, and the stddev is pretty low.

If not, then something is likely wrong with the network, or maybe one
of the servers has a degraded raid array.

(I suspect the numbers look quite a bit different for MPI-IO to
lustre, where that file system's locking infrastructure gets in the
way of well-coordinated parallel I/O, and would not at all be
surprised by a high standard deviation).

The timings using the 2 methods are:
Method 1:
HDF5 Read Time is: 3.52931666374207
HDF5 Write Time is: 271.7706000804901

Method 2:
Coll HDF5 Read Time is: 124.8922524452209
Coll HDF5 Write Time is: 393.1416127681732

Which one of these methods is correct ? Or is there some other way to
accurately measure the timings ? Would profiling be a better option ?

Well, they are both "right" as long as you are clear in what you are
measuring, but I think method #2 gives more meaningful results.

Actually, comparing the two is quite interesting, too, though to
really understand what's going on with that low read time, you'd have
to, for example, use MPE and Jumpshot to really see what happened.

OK, so short answer is to time everyone's I/O time and take the
maximum, but when it comes to benchmarks, more information is better
so if it's not a big pain to also report the min and avg, do that as
well.

Oh, and Nikhil and I know why write is so much higher than reads, but
here's the quick lowdown for everyone else. It has to do with the
poor Lustre support in the MPI-IO library installed on Franklin --
something I and others are working to address in the very near future.

Excellent question!
==rob

···

On Mon, Dec 8, 2008 at 9:29 AM, Robert Latham <robl@mcs.anl.gov> wrote:
On Sat, Dec 06, 2008 at 05:22:47PM -0600, Nikhil Laghave wrote:

--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.