Hi Nikhil,
I am in the NERSC Analytics group and have done extensive benchmarking
and testing of I/O on Franklin. We have been working in collaboration
with the HDF Group for almost a year now to improve parallel I/O
performance on Lustre file systems, with Franklin as one of our
primary test machines.
The root-only write scenario you describe will always lead to
serialization, because you have only one compute node communicating
with the I/O servers (called "OSTs" in Lustre terminology).
In your parallel scenario (which is called "independent" parallel I/O,
as opposed to "collective" parallel I/O which I will describe in a
bit), you are probably experiencing serialization because you are
using the default Lustre striping on Franklin, which uses only 2
stripes. This means that all 200 of your processors are communicating
with only 2 OSTs, out of 48 available. You can find more about Lustre
striping from this page:
http://www.nersc.gov/nusers/systems/franklin/io.php
If you increase the stripe count using
stripe_large myOutputDir/ (sets the striping on the directory and any
new files created in it)
or
stripe_medium specificFile.h5 (this touches the file before your
program runs, but needs to be done for each output file)
you will use all 48 OSTs and should see improved performance in
parallel mode. From your plot, it looks like you are getting around
500-1100MB/s write bandwidth out of the ~12GB/s peak available on
Franklin.
A further optimization that may help is to enable "collective" mode,
which creates a one-to-one mapping between a subset of your processors
and the OSTs, and involves a communication step similar to the one you
implemented for the root-only scenario. The other processors send
their data to the subset, and the subset writes the data to disk (this
is called "two-phase I/O" or "collective buffering"). The additional
coordination achieved by collective I/O can improve performance for
many I/O patterns. You can find more details about this in the NERSC
parallel I/O tutorial:
http://www.nersc.gov/nusers/help/tutorials/io/
including some code snippets for how to set this up in HDF5. It also
summarizes some of the improvements we have been working on, which
will soon be rolled into the public release of the HDF5 library.
Let me know if you have more questions, or want to continue this
discussion offline. I would be glad to talk with you further or to
help you modify your code or run more I/O tests.
Mark
···
On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave <nikhil.laghave@gmail.com> wrote:
Hello All,
I have a generic question regarding the comparison of seq. binary and parallel HDF5 for I/O of large files.
I am using the franklin supercomputer at NERSC for my experiments. The datasets/file size are between 55GB and 111GB which are being written by a single processor in case of seq. binary. In this case, several(~200) processors send the data to a single root processor, which does the I/O disk. So, basically only 1 processor is doing the I/O to disk.
In case of parallel HDF5, all the ~200 processors do the I/O to disk independently without communication to the root processor.
However, on the LUSTRE file system, there are file locks leading to all the ~200 write operations to be serialized in actuality.
Now when I compare the performance of seq. binary vs parallel HDF5, the only difference is that in case of seq. binary, there is communication overhead which according to my measurements are not a big overhead. In that case since both the writes(seq. binary & parallel HDF5) are sequential/serialized, I expected the performance to be similar. However, in my experiments, parallel HDF5 outperforms seq. binary significantly. I do not understand why this so since even parallel HDF5 write operations are serialized. The plot attached explains my doubt.
Please can someone explain to me why parallel HDF5 outperforms seq. binary writes even though parallel HDF5 writes are also serialized. Your inputs are greatly appreciated. Thank You.
Nikhil
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org