Questions regarding Parallel HDF5 vs Seq. Binary

Hello All,

I have a generic question regarding the comparison of seq. binary and parallel HDF5 for I/O of large files.

I am using the franklin supercomputer at NERSC for my experiments. The datasets/file size are between 55GB and 111GB which are being written by a single processor in case of seq. binary. In this case, several(~200) processors send the data to a single root processor, which does the I/O disk. So, basically only 1 processor is doing the I/O to disk.

In case of parallel HDF5, all the ~200 processors do the I/O to disk independently without communication to the root processor.

However, on the LUSTRE file system, there are file locks leading to all the ~200 write operations to be serialized in actuality.

Now when I compare the performance of seq. binary vs parallel HDF5, the only difference is that in case of seq. binary, there is communication overhead which according to my measurements are not a big overhead. In that case since both the writes(seq. binary & parallel HDF5) are sequential/serialized, I expected the performance to be similar. However, in my experiments, parallel HDF5 outperforms seq. binary significantly. I do not understand why this so since even parallel HDF5 write operations are serialized. The plot attached explains my doubt.

Please can someone explain to me why parallel HDF5 outperforms seq. binary writes even though parallel HDF5 writes are also serialized. Your inputs are greatly appreciated. Thank You.

Nikhil

writeanalysis1.pdf (31.3 KB)

Hi Nikhil,

I am in the NERSC Analytics group and have done extensive benchmarking
and testing of I/O on Franklin. We have been working in collaboration
with the HDF Group for almost a year now to improve parallel I/O
performance on Lustre file systems, with Franklin as one of our
primary test machines.

The root-only write scenario you describe will always lead to
serialization, because you have only one compute node communicating
with the I/O servers (called "OSTs" in Lustre terminology).

In your parallel scenario (which is called "independent" parallel I/O,
as opposed to "collective" parallel I/O which I will describe in a
bit), you are probably experiencing serialization because you are
using the default Lustre striping on Franklin, which uses only 2
stripes. This means that all 200 of your processors are communicating
with only 2 OSTs, out of 48 available. You can find more about Lustre
striping from this page:

http://www.nersc.gov/nusers/systems/franklin/io.php

If you increase the stripe count using

stripe_large myOutputDir/ (sets the striping on the directory and any
new files created in it)

or

stripe_medium specificFile.h5 (this touches the file before your
program runs, but needs to be done for each output file)

you will use all 48 OSTs and should see improved performance in
parallel mode. From your plot, it looks like you are getting around
500-1100MB/s write bandwidth out of the ~12GB/s peak available on
Franklin.

A further optimization that may help is to enable "collective" mode,
which creates a one-to-one mapping between a subset of your processors
and the OSTs, and involves a communication step similar to the one you
implemented for the root-only scenario. The other processors send
their data to the subset, and the subset writes the data to disk (this
is called "two-phase I/O" or "collective buffering"). The additional
coordination achieved by collective I/O can improve performance for
many I/O patterns. You can find more details about this in the NERSC
parallel I/O tutorial:

http://www.nersc.gov/nusers/help/tutorials/io/

including some code snippets for how to set this up in HDF5. It also
summarizes some of the improvements we have been working on, which
will soon be rolled into the public release of the HDF5 library.

Let me know if you have more questions, or want to continue this
discussion offline. I would be glad to talk with you further or to
help you modify your code or run more I/O tests.

Mark

···

On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave <nikhil.laghave@gmail.com> wrote:

Hello All,

I have a generic question regarding the comparison of seq. binary and parallel HDF5 for I/O of large files.

I am using the franklin supercomputer at NERSC for my experiments. The datasets/file size are between 55GB and 111GB which are being written by a single processor in case of seq. binary. In this case, several(~200) processors send the data to a single root processor, which does the I/O disk. So, basically only 1 processor is doing the I/O to disk.

In case of parallel HDF5, all the ~200 processors do the I/O to disk independently without communication to the root processor.

However, on the LUSTRE file system, there are file locks leading to all the ~200 write operations to be serialized in actuality.

Now when I compare the performance of seq. binary vs parallel HDF5, the only difference is that in case of seq. binary, there is communication overhead which according to my measurements are not a big overhead. In that case since both the writes(seq. binary & parallel HDF5) are sequential/serialized, I expected the performance to be similar. However, in my experiments, parallel HDF5 outperforms seq. binary significantly. I do not understand why this so since even parallel HDF5 write operations are serialized. The plot attached explains my doubt.

Please can someone explain to me why parallel HDF5 outperforms seq. binary writes even though parallel HDF5 writes are also serialized. Your inputs are greatly appreciated. Thank You.

Nikhil

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Mark,

Thanks for your reply.

Actually my data layout is contiguous, therefore there are no optimizations that can be done using collective IO. Basically I am writing a 1D array to disk which is distributed among many processors and each processor holding contiguous data. I will try your striping suggestions to see how much the performance improves.

One question I had was regarding locks in the lustre file system. As per my know, lustre fs puts write locks on each writer processor thus serializing the parallel write operations. Has this changed or can I actually have parallel writes on one file simultaneously? Thank You.

Nikhil

···

On Jul 22, 2010, at 12:22 PM, Mark Howison wrote:

Hi Nikhil,

I am in the NERSC Analytics group and have done extensive benchmarking
and testing of I/O on Franklin. We have been working in collaboration
with the HDF Group for almost a year now to improve parallel I/O
performance on Lustre file systems, with Franklin as one of our
primary test machines.

The root-only write scenario you describe will always lead to
serialization, because you have only one compute node communicating
with the I/O servers (called "OSTs" in Lustre terminology).

In your parallel scenario (which is called "independent" parallel I/O,
as opposed to "collective" parallel I/O which I will describe in a
bit), you are probably experiencing serialization because you are
using the default Lustre striping on Franklin, which uses only 2
stripes. This means that all 200 of your processors are communicating
with only 2 OSTs, out of 48 available. You can find more about Lustre
striping from this page:

http://www.nersc.gov/nusers/systems/franklin/io.php

If you increase the stripe count using

stripe_large myOutputDir/ (sets the striping on the directory and any
new files created in it)

or

stripe_medium specificFile.h5 (this touches the file before your
program runs, but needs to be done for each output file)

you will use all 48 OSTs and should see improved performance in
parallel mode. From your plot, it looks like you are getting around
500-1100MB/s write bandwidth out of the ~12GB/s peak available on
Franklin.

A further optimization that may help is to enable "collective" mode,
which creates a one-to-one mapping between a subset of your processors
and the OSTs, and involves a communication step similar to the one you
implemented for the root-only scenario. The other processors send
their data to the subset, and the subset writes the data to disk (this
is called "two-phase I/O" or "collective buffering"). The additional
coordination achieved by collective I/O can improve performance for
many I/O patterns. You can find more details about this in the NERSC
parallel I/O tutorial:

http://www.nersc.gov/nusers/help/tutorials/io/

including some code snippets for how to set this up in HDF5. It also
summarizes some of the improvements we have been working on, which
will soon be rolled into the public release of the HDF5 library.

Let me know if you have more questions, or want to continue this
discussion offline. I would be glad to talk with you further or to
help you modify your code or run more I/O tests.

Mark

On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave > <nikhil.laghave@gmail.com> wrote:

Hello All,

I have a generic question regarding the comparison of seq. binary and parallel HDF5 for I/O of large files.

I am using the franklin supercomputer at NERSC for my experiments. The datasets/file size are between 55GB and 111GB which are being written by a single processor in case of seq. binary. In this case, several(~200) processors send the data to a single root processor, which does the I/O disk. So, basically only 1 processor is doing the I/O to disk.

In case of parallel HDF5, all the ~200 processors do the I/O to disk independently without communication to the root processor.

However, on the LUSTRE file system, there are file locks leading to all the ~200 write operations to be serialized in actuality.

Now when I compare the performance of seq. binary vs parallel HDF5, the only difference is that in case of seq. binary, there is communication overhead which according to my measurements are not a big overhead. In that case since both the writes(seq. binary & parallel HDF5) are sequential/serialized, I expected the performance to be similar. However, in my experiments, parallel HDF5 outperforms seq. binary significantly. I do not understand why this so since even parallel HDF5 write operations are serialized. The plot attached explains my doubt.

Please can someone explain to me why parallel HDF5 outperforms seq. binary writes even though parallel HDF5 writes are also serialized. Your inputs are greatly appreciated. Thank You.

Nikhil

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Nikhil,

Lustre locks files on a per-stripe basis. The default striping on
Franklin is a stripe count of 2 and stripe size of 4MB. So a file is
broken into 4MB regions that alternate between two OSTs. If two or
more processors try to write to the same 4MB, Lustre invokes a lock
manager to serialize access, which can be very costly.

You can also experience "self-contention" in independent mode. If you
have 200 processors writing to 48 OSTs, each OST is servicing several
of your processors. Maybe this is what you mean by "write locks on
each writer processor"?

Are you writing out the same amount of contiguous data from each
processor? If so, you may want to continue to use independent mode in
combination with the 'chunking' and 'alignment' features of HDF5
(described in the NERSC I/O tutorial). This will allow you to
guarantee that each processor writes to an offset in the shared file
that is a multiple of the stripe size, so that there are no lock
contentions. You can also go a step further and set the stripe size to
the size of the contiguous data you are writing. This effectively
creates a file-per-processor write pattern: each processor writes to a
disjoint region of the shared file and to only one stripe/OST.

If the processors are writing different amounts of data, you probably
want to use collective I/O. Even though you are right that the
greatest benefits of collective buffering are for non-contiguous data,
the collective buffering algorithm in the Cray MPI-IO library is now
Lustre aware and will break your I/O pattern up into stripe-aligned
writes, and the number of writers will be set to the stripe count.
Again, this will guarantee no lock contention and will setup a pattern
that resembles like file-per-processor from the OSTs' point of view.

Mark

···

On Thu, Jul 22, 2010 at 10:34 AM, Nikhil Laghave <nikhil.laghave@gmail.com> wrote:

Hi Mark,

Thanks for your reply.

Actually my data layout is contiguous, therefore there are no optimizations that can be done using collective IO. Basically I am writing a 1D array to disk which is distributed among many processors and each processor holding contiguous data. I will try your striping suggestions to see how much the performance improves.

One question I had was regarding locks in the lustre file system. As per my know, lustre fs puts write locks on each writer processor thus serializing the parallel write operations. Has this changed or can I actually have parallel writes on one file simultaneously? Thank You.

Nikhil

On Jul 22, 2010, at 12:22 PM, Mark Howison wrote:

Hi Nikhil,

I am in the NERSC Analytics group and have done extensive benchmarking
and testing of I/O on Franklin. We have been working in collaboration
with the HDF Group for almost a year now to improve parallel I/O
performance on Lustre file systems, with Franklin as one of our
primary test machines.

The root-only write scenario you describe will always lead to
serialization, because you have only one compute node communicating
with the I/O servers (called "OSTs" in Lustre terminology).

In your parallel scenario (which is called "independent" parallel I/O,
as opposed to "collective" parallel I/O which I will describe in a
bit), you are probably experiencing serialization because you are
using the default Lustre striping on Franklin, which uses only 2
stripes. This means that all 200 of your processors are communicating
with only 2 OSTs, out of 48 available. You can find more about Lustre
striping from this page:

http://www.nersc.gov/nusers/systems/franklin/io.php

If you increase the stripe count using

stripe_large myOutputDir/ (sets the striping on the directory and any
new files created in it)

or

stripe_medium specificFile.h5 (this touches the file before your
program runs, but needs to be done for each output file)

you will use all 48 OSTs and should see improved performance in
parallel mode. From your plot, it looks like you are getting around
500-1100MB/s write bandwidth out of the ~12GB/s peak available on
Franklin.

A further optimization that may help is to enable "collective" mode,
which creates a one-to-one mapping between a subset of your processors
and the OSTs, and involves a communication step similar to the one you
implemented for the root-only scenario. The other processors send
their data to the subset, and the subset writes the data to disk (this
is called "two-phase I/O" or "collective buffering"). The additional
coordination achieved by collective I/O can improve performance for
many I/O patterns. You can find more details about this in the NERSC
parallel I/O tutorial:

http://www.nersc.gov/nusers/help/tutorials/io/

including some code snippets for how to set this up in HDF5. It also
summarizes some of the improvements we have been working on, which
will soon be rolled into the public release of the HDF5 library.

Let me know if you have more questions, or want to continue this
discussion offline. I would be glad to talk with you further or to
help you modify your code or run more I/O tests.

Mark

On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave >> <nikhil.laghave@gmail.com> wrote:

Hello All,

I have a generic question regarding the comparison of seq. binary and parallel HDF5 for I/O of large files.

I am using the franklin supercomputer at NERSC for my experiments. The datasets/file size are between 55GB and 111GB which are being written by a single processor in case of seq. binary. In this case, several(~200) processors send the data to a single root processor, which does the I/O disk. So, basically only 1 processor is doing the I/O to disk.

In case of parallel HDF5, all the ~200 processors do the I/O to disk independently without communication to the root processor.

However, on the LUSTRE file system, there are file locks leading to all the ~200 write operations to be serialized in actuality.

Now when I compare the performance of seq. binary vs parallel HDF5, the only difference is that in case of seq. binary, there is communication overhead which according to my measurements are not a big overhead. In that case since both the writes(seq. binary & parallel HDF5) are sequential/serialized, I expected the performance to be similar. However, in my experiments, parallel HDF5 outperforms seq. binary significantly. I do not understand why this so since even parallel HDF5 write operations are serialized. The plot attached explains my doubt.

Please can someone explain to me why parallel HDF5 outperforms seq. binary writes even though parallel HDF5 writes are also serialized. Your inputs are greatly appreciated. Thank You.

Nikhil

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org