Very poor performance of pHDF5 when using single (shared) file

I've run some benchmark, where within an MPI program, each process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and I've tested 2)-3) for both independent/collective options of the MPI driver. I've also used 3 different clusters for measurements (all quite modern).

As a result, the running (storage) times of the same-file strategy, i.e. 2) and 3), were of orders of magnitudes longer than the running times of the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data, chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

Hi Daniel,

Yes the numbers look very bad, but from your use case description, I'm pretty sure something can be done to improve the performance.

What is the file system you are working with?
Did you make sure to set the stripe size and count of your file to something larger than the default (usually the default stripe count is something small like 1 or 2)? From the amount of data you are writing, I would try something like the maximum, then you can tune more if you want. The stripe size can be increased to something larger than the default too.
The above usually solve 90% of users problems when seeing bad performance when accessing a single shared file.

If the above does not help or you have done that already, then I'll need to take a deeper look at how your processes are accessing the datasets.
It sounds like each process is writing a good large piece of data so you should not see this poor performance; although the separate file thing is always going to be the fastest strategy (not convenient though).

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Langr
Sent: Friday, August 30, 2013 9:05 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and I've tested 2)-3) for both independent/collective options of the MPI driver. I've also used 3 different clusters for measurements (all quite modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data, chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

HI Daniel,

You did not say what parallel file system you used in 2) and 3). The
performance of the parallel file system is important in these cases.
E.g., if your PFS is not truly scalable, a 512 processes access
could drop the IO speed to 1/500 of 1 process access speed
(they all compete for a write-token, for example).

Another issue--if your cluster system is a linux system, 100MB data
is way too small to make any conclusion. Linux OS is known to
do kernel IO for as much as it can get. E.g., if the linux system each
has 4 GB memory and is not too busy, it can use 3+GB memory for
IO. Therefore, any writes less than 3GB is merely copying data
from user memory to kernel memory. The "IO speed" is just
memory to memory speed, not truly memory to disk speed.
One way to confirm this is to do writes at least 2 times more than
total memory of the processor. Compare the "2 times" write speed
vs the 100MB speed and you should see a big drop.

I would suggest you build and use the performance measurement
tool, perform/h5perf in the HDF5 source. h5perf measures all three
IO speeds, POSIX speed, MPIO speed, pHDF5 speed. It gives you
a better understanding of what your parallel file system can deliver.

Hope this helps.

-Albert Cheng
THG staff

···

On Aug 30, 2013, at 9:05 AM, Daniel Langr <daniel.langr@gmail.com> wrote:

I've run some benchmark, where within an MPI program, each process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and I've tested 2)-3) for both independent/collective options of the MPI driver. I've also used 3 different clusters for measurements (all quite modern).

As a result, the running (storage) times of the same-file strategy, i.e. 2) and 3), were of orders of magnitudes longer than the running times of the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data, chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Hi Albert and Mohamad,

haven't received e-mails with your replies :frowning: so I cannot reply to them specifically (or do not know how). So, replying to my original...

@Albert:

All the used clusters uses the Lustre file system, and I believe, this file system should be scalable, at least to some extent. Apparently, it is scalable for the single-file-per-process strategy.

I understand the notice about memory-to-kernel writes. However, again, I am comparing the single/multiple-file strategy. Both give quite different results. Moreover, the multiple-file case correspond, within my measurements, to the listed peak I/O bandwidth of the storage systems. The single-file case is much much worse. Obviously, this is not limited by the memory-kernel copying.

Thanks for the link to hd5perf, I will try it.

@Mohamad:

Thanks for hint, all file systems are Lustre-based, indeed with default strip count 1. I will rerun my measurements with different strip size/count and post the results.

Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

···

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

Mohamad,

I really do not understand how to reply to this forum :(. I tried to reply to your post, which I received via e-mail. In this e-mail, there was the following note:

"
If you reply to this email, your message will be added to the discussion below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"

So, I replied to this e-mail, and received another one:

"
Delivery to the following recipient failed permanently:
ml-node+s184993n4026449h5@n3.nabble.com

Your email to ml-node+s184993n4026449h5@n3.nabble.com has been rejected because you are not allowed to post to http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html . Please contact the owner about permissions or visit the Nabble Support forum.
"

What the hell... why does it say I should reply and then that I am not allowed to post to my own thread???

Anyway, I tried to post the following information:

I did some experiments yesterday using the BlueWaters cluster. The stripe count is limited there to 160. For runs with 256 MPI processes/cores and fixed datasets were the writing times:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and size, presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but comparing with the separate-files strategy, the writing time is still more than ten times slower . Do you think it is "normal"?

Might chunking help here?

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

···

Subject: Delivery Status Notification (Failure)

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

Babak,

I am much interested in this toolkit you wrote about. Is it installed on BlueWaters publicly? What it's name?

Thanks, Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

···

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

Hi Mohamad,

I did some experiments using the BlueWaters cluster. The stripe count is limited there to 160. For runs with 256 MPI processes/cores and fixed datasets were the writing times:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and size, presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but comparing with the separate-files strategy, the writing time is still more than ten times slower . Do you think it is "normal"?

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

···

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

Hi,

With a strip count of 1, all your access to a single file will be done
through one OSS. Contrary to the multiple-file case, you won't use the
whole system bandwidth. This means that the poor performance is to be
expected.

From what I gather, you should write on your FS with a chunk of

"stripe size" aligned on "stripe size" from "stripe count" processes
to have the maximum performance.

Cheers,

Matthieu

···

2013/9/2 Daniel Langr <daniel.langr@gmail.com>:

Hi Albert and Mohamad,

haven't received e-mails with your replies :frowning: so I cannot reply to them
specifically (or do not know how). So, replying to my original...

@Albert:

All the used clusters uses the Lustre file system, and I believe, this file
system should be scalable, at least to some extent. Apparently, it is
scalable for the single-file-per-process strategy.

I understand the notice about memory-to-kernel writes. However, again, I am
comparing the single/multiple-file strategy. Both give quite different
results. Moreover, the multiple-file case correspond, within my
measurements, to the listed peak I/O bandwidth of the storage systems. The
single-file case is much much worse. Obviously, this is not limited by the
memory-kernel copying.

Thanks for the link to hd5perf, I will try it.

@Mohamad:

Thanks for hint, all file systems are Lustre-based, indeed with default
strip count 1. I will rerun my measurements with different strip size/count
and post the results.

Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

Hi Daniel,

haven't received e-mails with your replies :frowning: so I cannot reply to them

specifically (or do not know how). So, replying to my original...

I'm not sure what you mean; I saw my email go through to the Forum list but you could not reply to it -:?

I understand the notice about memory-to-kernel writes. However, again, I

am comparing the single/multiple-file strategy. Both give quite different

results. Moreover, the multiple-file case correspond, within my

measurements, to the listed peak I/O bandwidth of the storage systems. The

single-file case is much much worse. Obviously, this is not limited by the

memory-kernel copying.

As long as the same amount of data from each process is written in the 3 cases you mentioned (which is the case here as I understood), I do not suspect that the in kernel mem copy is causing the huge gap in performance.

Thanks for the link to hd5perf, I will try it.

@Mohamad:

Thanks for hint, all file systems are Lustre-based, indeed with default strip

count 1. I will rerun my measurements with different strip size/count and

post the results.

Yes with a stripe count of 1; that will definitely slow you down because of Locking/contention issues; so it's not a fair comparison against the multiple file case, where each file might/will get a different OSS.

Increasing the stripe count/size should get you much better performance (I bet).

I don't know what machine you are using, but usually every setup has recommended stripe size/count for the file size of your application. Those usually get thrown somewhere on the user-guide website for using that machine, if it exists.

Thanks,

Mohamad

···

Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> I've run some benchmark, where within an MPI program, each process

> wrote

> 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the

> following writing strategies:

>

> 1) each process writes to its own file,

> 2) each process writes to the same file to its own dataset,

> 3) each process writes to the same file to a same dataset.

>

> I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024),

> and I've tested 2)-3) for both independent/collective options of the

> MPI driver. I've also used 3 different clusters for measurements (all

> quite modern).

>

> As a result, the running (storage) times of the same-file strategy, i.e.

> 2) and 3), were of orders of magnitudes longer than the running times

> of the separate-files strategy. For illustration:

>

> cluster #1, 512 MPI processes, each process stores 100 MB of data,

> fixed data sets:

>

> 1) separate files: 2.73 [s]

> 2) single file, independent calls, separate data sets: 88.54[s]

>

> cluster #2, 256 MPI processes, each process stores 100 MB of data,

> chunked data sets (chunk size 1024):

>

> 1) separate files: 10.40 [s]

> 2) single file, independent calls, shared data sets: 295 [s]

> 3) single file, collective calls, shared data sets: 3275 [s]

>

> Any idea why the single-file strategy gives so poor writing performance?

>

> Daniel

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and
size, presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but
comparing with the separate-files strategy, the writing time is
still more than ten times slower . Do you think it is "normal"?

It might be "normal" for Lustre, but it's not good. I wish I had
more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
The ADIOS folks report tuned-HDF5 to a single shared file runs about
60% slower than ADIOS to multiple files, not 10x slower, so it seems
there is room for improvement.

I've asked them about the kinds of things "tuned HDF5" entails, and
they didn't know (!).

There are quite a few settings documented in the intro_mpi(3) man
page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
you can try. I'm sorry to report that in my limited experience, the
documentation and reality are sometimes out of sync, especially with
respect to which settings are default or not.

==rob

···

On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
>I've run some benchmark, where within an MPI program, each process wrote
>3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
>writing strategies:
>
>1) each process writes to its own file,
>2) each process writes to the same file to its own dataset,
>3) each process writes to the same file to a same dataset.
>
>I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
>I've tested 2)-3) for both independent/collective options of the MPI
>driver. I've also used 3 different clusters for measurements (all quite
>modern).
>
>As a result, the running (storage) times of the same-file strategy, i.e.
>2) and 3), were of orders of magnitudes longer than the running times of
>the separate-files strategy. For illustration:
>
>cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
>data sets:
>
>1) separate files: 2.73 [s]
>2) single file, independent calls, separate data sets: 88.54[s]
>
>cluster #2, 256 MPI processes, each process stores 100 MB of data,
>chunked data sets (chunk size 1024):
>
>1) separate files: 10.40 [s]
>2) single file, independent calls, shared data sets: 295 [s]
>3) single file, collective calls, shared data sets: 3275 [s]
>
>Any idea why the single-file strategy gives so poor writing performance?
>
>Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob, Daniel, et al.,

I have started looking into this HDF5 performance issue on Cray systems.
I am looking "under the covers" to determine where the bottlenecks are
and what might be done about them. Here are some preliminary comments.

* Shared file performance will almost always be slower than file per
process performance because of the need for extent locking by the file
system.

* Shared file I/O using MPI I/O with collective buffering can usually
achieve better than 50% of file per process if the file accesses are
contiguous after aggregation.

* What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
interface and the HDF5 interface, is that the MPI_File_set_size() call
and the metadata writes done by HDF5 are both taking a lot of extra time.

* MPI_File_set_size eventually calls ftruncate(), which has been reported
to take a long time on Lustre file systems.

* The metadata is written by multiple processes in small records to the
same regions of the file. Some metadata always goes to the beginning of
the file but some is written to other parts of the file. Both cause a
lot of lock contention, which slows performance.

I still need to verify what I think I am seeing. I don't know yet what
can be done about either of these. But 5x or 10x slowdown is not
acceptable.

David

···

On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:

On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> separate files: 1.36 [s]
> single file, 1 stripe: 133.6 [s]
> single file, best result: 17.2 [s]
>
> (I did multiple runs with various combinations of strip count and
> size, presenting the best results I have obtained.)
>
> Increasing the number of stripes obviously helped a lot, but
> comparing with the separate-files strategy, the writing time is
> still more than ten times slower . Do you think it is "normal"?

It might be "normal" for Lustre, but it's not good. I wish I had
more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
The ADIOS folks report tuned-HDF5 to a single shared file runs about
60% slower than ADIOS to multiple files, not 10x slower, so it seems
there is room for improvement.

I've asked them about the kinds of things "tuned HDF5" entails, and
they didn't know (!).

There are quite a few settings documented in the intro_mpi(3) man
page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
you can try. I'm sorry to report that in my limited experience, the
documentation and reality are sometimes out of sync, especially with
respect to which settings are default or not.

==rob

> Thanks,
> Daniel
>
> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> >I've run some benchmark, where within an MPI program, each process wrote
> >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
> >writing strategies:
> >
> >1) each process writes to its own file,
> >2) each process writes to the same file to its own dataset,
> >3) each process writes to the same file to a same dataset.
> >
> >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
> >I've tested 2)-3) for both independent/collective options of the MPI
> >driver. I've also used 3 different clusters for measurements (all quite
> >modern).
> >
> >As a result, the running (storage) times of the same-file strategy, i.e.
> >2) and 3), were of orders of magnitudes longer than the running times of
> >the separate-files strategy. For illustration:
> >
> >cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
> >data sets:
> >
> >1) separate files: 2.73 [s]
> >2) single file, independent calls, separate data sets: 88.54[s]
> >
> >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> >chunked data sets (chunk size 1024):
> >
> >1) separate files: 10.40 [s]
> >2) single file, independent calls, shared data sets: 295 [s]
> >3) single file, collective calls, shared data sets: 3275 [s]
> >
> >Any idea why the single-file strategy gives so poor writing performance?
> >
> >Daniel
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--

Rob,

thanks a lot for hints. I will look at the suggested option and try some experiments with it :).

Daniel

Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

···

On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and
size, presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but
comparing with the separate-files strategy, the writing time is
still more than ten times slower . Do you think it is "normal"?

It might be "normal" for Lustre, but it's not good. I wish I had
more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
The ADIOS folks report tuned-HDF5 to a single shared file runs about
60% slower than ADIOS to multiple files, not 10x slower, so it seems
there is room for improvement.

I've asked them about the kinds of things "tuned HDF5" entails, and
they didn't know (!).

There are quite a few settings documented in the intro_mpi(3) man
page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
you can try. I'm sorry to report that in my limited experience, the
documentation and reality are sometimes out of sync, especially with
respect to which settings are default or not.

==rob

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Rob, Daniel, et al.,

I have started looking into this HDF5 performance issue on Cray systems.
I am looking "under the covers" to determine where the bottlenecks are
and what might be done about them. Here are some preliminary comments.

Awesome! with David Knaak on the case this is going to set sorted out
toot suite.

* Shared file performance will almost always be slower than file per
process performance because of the need for extent locking by the file
system.

The extent locks are a problem, yes, but let's not discount the
overhead of creating N files. Now, on GPFS the story is abysmal,
and maybe Lustre creates files in /close to/ no time flat, but
creating a file does cost something.

* Shared file I/O using MPI I/O with collective buffering can usually
achieve better than 50% of file per process if the file accesses are
contiguous after aggregation.

Agreed! Early (like 2008-era) Cray MPI-IO had a collective I/O
algorithm that was not well suited to Lustre. That's no longer the
case, and has not been since MPT-3.2 but those initial poor
experiences entered folklore and now "collective I/O is slow" is what
everyone thinks, even 6 years and two machines later.

* What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
interface and the HDF5 interface, is that the MPI_File_set_size() call
and the metadata writes done by HDF5 are both taking a lot of extra time.

Quincey will have to speak to this one, but I thought they greatly
reduced the number of MPI_File_set_size() calls in a recent release?

* MPI_File_set_size eventually calls ftruncate(), which has been reported
to take a long time on Lustre file systems.

The biggest problem with MPI_File_set_size and ftruncate is that
MPI_File_set_size is collective. I don't know what changes Cray's
made to ROMIO, but for a long time ROMIO's had a "call ftruncate on
one processor" optimization. David can confirm if ADIOI_GEN_Resize or
it's equivalent contains that optimization.

* The metadata is written by multiple processes in small records to the
same regions of the file. Some metadata always goes to the beginning of
the file but some is written to other parts of the file. Both cause a
lot of lock contention, which slows performance.

I've bugged the HDF5 guys about this since 2008. It's work in
progress under ExaHDF5 (i think), so there's hope that we will see a
scalable metadata approach soon.

I still need to verify what I think I am seeing. I don't know yet what
can be done about either of these. But 5x or 10x slowdown is not
acceptable.

One thing that is missing is a good understanding of what the MPT
tuning knobs can and cannot do. Another thing that is missing is a
way to detangle the I/O stack: if we had some way to say "this app
spent X% in hdf5-related things, Y% in MPI-IO things and Z% in
lustre-things", that would go a long way towards directing effort.

Have you seen the work we've done witn Darshan lately? Darshan had
some bad experiences on Lustre a few years back, but Phil Carns and
Yushu Yao have really whipped it into shape for Hopper (see phil and
yushu's recent CUG paper). It'd be nice to have Darshan on more Cray
systems. It's been a huge asset on Argonne's Blue Gene machines.

==rob

···

On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:

On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > separate files: 1.36 [s]
> > single file, 1 stripe: 133.6 [s]
> > single file, best result: 17.2 [s]
> >
> > (I did multiple runs with various combinations of strip count and
> > size, presenting the best results I have obtained.)
> >
> > Increasing the number of stripes obviously helped a lot, but
> > comparing with the separate-files strategy, the writing time is
> > still more than ten times slower . Do you think it is "normal"?
>
> It might be "normal" for Lustre, but it's not good. I wish I had
> more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> The ADIOS folks report tuned-HDF5 to a single shared file runs about
> 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> there is room for improvement.
>
> I've asked them about the kinds of things "tuned HDF5" entails, and
> they didn't know (!).
>
> There are quite a few settings documented in the intro_mpi(3) man
> page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> you can try. I'm sorry to report that in my limited experience, the
> documentation and reality are sometimes out of sync, especially with
> respect to which settings are default or not.
>
> ==rob
>
> > Thanks,
> > Daniel
> >
> > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > >I've run some benchmark, where within an MPI program, each process wrote
> > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
> > >writing strategies:
> > >
> > >1) each process writes to its own file,
> > >2) each process writes to the same file to its own dataset,
> > >3) each process writes to the same file to a same dataset.
> > >
> > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
> > >I've tested 2)-3) for both independent/collective options of the MPI
> > >driver. I've also used 3 different clusters for measurements (all quite
> > >modern).
> > >
> > >As a result, the running (storage) times of the same-file strategy, i.e.
> > >2) and 3), were of orders of magnitudes longer than the running times of
> > >the separate-files strategy. For illustration:
> > >
> > >cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
> > >data sets:
> > >
> > >1) separate files: 2.73 [s]
> > >2) single file, independent calls, separate data sets: 88.54[s]
> > >
> > >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> > >chunked data sets (chunk size 1024):
> > >
> > >1) separate files: 10.40 [s]
> > >2) single file, independent calls, shared data sets: 295 [s]
> > >3) single file, collective calls, shared data sets: 3275 [s]
> > >
> > >Any idea why the single-file strategy gives so poor writing performance?
> > >
> > >Daniel
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > Hdf-forum@lists.hdfgroup.org
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

This morning, I did some poking around and found that the cmake
based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to
be set to false and no GPFS optimizations are compiled in (libgpfs
is not detected). Having tweaked that, you can imagine my happiness
when I recompiled everything and now I'm getting even worse
Bandwidth.

Thanks for the report on those hints. HDF5 contains, outside of
gpfs-specific benchmarks, one of the few implementations of all the
gpfs_fcntl() tuning parameters. Given your experience, probably best
to turn off those hints.

Also, cmake works on bluegene? wow. Don't forget that bluegene
requires cross compliation.

In fact if I enable collective IO, the app coredumps on me, so the
situations is worse than I had feared. I'm using too much memory in
my test I suspect and collectives are pushing me over the limit. The
only test I can run with collective enabled is the one that uses
only one rank and writes 16MB!

How many processes per node are you using on your BGQ? if you are
loading up with 64 procs per node, that will give each one about
200-230 MiB of scratch space.

I wonder if you have built some or all of your hdf5 library for the
front end nodes, and some or none for the compute nodes?

How many processes are you running here?

A month back I ran some one-rack experiments:

Here's my IOR config file. Note two tuning parameters here:
- "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
  for Blue Gene /Q
- the 'bglockless' prefix is "robl's secret turbo button". it was fun
  to pull that rabbit out of the hat... for the first few years.
  (it's not the default because in one specific case performance is
  shockingly poor).

IOR START
        numTasks=65536
        repetitions=3
        reorderTasksConstant=1024
        fsync=1
        transferSize=6M
        blockSize=6M
        collective=1
        showHints=1
        hintsFileName=IOR-hints-bg_nodes_pset.64
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.mpi
        api=MPIIO
        RUN
        api=HDF5
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.h5
        RUN
        api=NCMPI
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.nc
        RUN
IOR STOP

Rob : you mentioned some fcntl functions were deprecated etc. do I
need to remove these to stop the coredumps? (I'm very much hoping
something has gone wrong with my tests because the performance is
shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

Unless you are running BGQ system software driver V1R2M1, the
gpfs_fcntl hints do not get forwarded to storage, and return an error.
It's possible HDF5 responds to that error with a core dump?

==rob

···

On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

JB

> -----Original Message----- From: Hdf-forum
> [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel
> Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> single (shared) file
>
> Rob,
>
> thanks a lot for hints. I will look at the suggested option and
> try some experiments with it :).
>
> Daniel
>
>
>
> Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]
> >> single file, best result: 17.2 [s]
> >>
> >> (I did multiple runs with various combinations of strip count
> >> and size, presenting the best results I have obtained.)
> >>
> >> Increasing the number of stripes obviously helped a lot, but
> >> comparing with the separate-files strategy, the writing time is
> >> still more than ten times slower . Do you think it is "normal"?
> >
> > It might be "normal" for Lustre, but it's not good. I wish I
> > had more experience tuning the Cray/MPI-IO/Lustre stack, but I
> > do not. The ADIOS folks report tuned-HDF5 to a single shared
> > file runs about 60% slower than ADIOS to multiple files, not 10x
> > slower, so it seems there is room for improvement.
> >
> > I've asked them about the kinds of things "tuned HDF5" entails,
> > and they didn't know (!).
> >
> > There are quite a few settings documented in the intro_mpi(3)
> > man page. MPICH_MPIIO_CB_ALIGN will probably be the most
> > important thing you can try. I'm sorry to report that in my
> > limited experience, the documentation and reality are sometimes
> > out of sync, especially with respect to which settings are
> > default or not.
> >
> > ==rob
> >
> >> Thanks, Daniel
> >>
> >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> >>> I've run some benchmark, where within an MPI program, each
> >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> >>> I've used the following writing strategies:
> >>>
> >>> 1) each process writes to its own file, 2) each process writes
> >>> to the same file to its own dataset, 3) each process writes to
> >>> the same file to a same dataset.
> >>>
> >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> >>> 1024), and I've tested 2)-3) for both independent/collective
> >>> options of the MPI driver. I've also used 3 different clusters
> >>> for measurements (all quite modern).
> >>>
> >>> As a result, the running (storage) times of the same-file
> >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer
> >>> than the running times of the separate-files strategy. For
> >>> illustration:
> >>>
> >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> >>> data, fixed data sets:
> >>>
> >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> >>> separate data sets: 88.54[s]
> >>>
> >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> >>> data, chunked data sets (chunk size 1024):
> >>>
> >>> 1) separate files: 10.40 [s] 2) single file, independent
> >>> calls, shared data sets: 295 [s] 3) single file, collective
> >>> calls, shared data sets: 3275 [s]
> >>>
> >>> Any idea why the single-file strategy gives so poor writing
> >>> performance?
> >>>
> >>> Daniel
> >>
> >> _______________________________________________ Hdf-forum is
> >> for HDF software users discussion.
> >> Hdf-forum@lists.hdfgroup.org
> >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgr
> >> oup.org
> >
>
> _______________________________________________ Hdf-forum is for
> HDF software users discussion. Hdf-forum@lists.hdfgroup.org
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Just another voice to join the choir...

Our BGQ using GPFS was switched on last week and my HDF5 performance has been around 10% of IOR - which compares to the cray running lustre where we get 60% IOR or thereabouts.

This morning, I did some poking around and found that the cmake based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to be set to false and no GPFS optimizations are compiled in (libgpfs is not detected). Having tweaked that, you can imagine my happiness when I recompiled everything and now I'm getting even worse Bandwidth.

In fact if I enable collective IO, the app coredumps on me, so the situations is worse than I had feared. I'm using too much memory in my test I suspect and collectives are pushing me over the limit. The only test I can run with collective enabled is the one that uses only one rank and writes 16MB!

Looks like I'm going to have to spend quite a bit more time looking at this.

If anyone else is making tweaks to the hdf5 source, please let me know as I don't want to duplicate what anyone else is doing, but I'll be happy to help track down issues.

Rob : you mentioned some fcntl functions were deprecated etc. do I need to remove these to stop the coredumps? (I'm very much hoping something has gone wrong with my tests because the performance is shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

JB

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf
Of Daniel Langr
Sent: 20 September 2013 13:46
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
(shared) file

Rob,

thanks a lot for hints. I will look at the suggested option and try some
experiments with it :).

Daniel

Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
>> separate files: 1.36 [s]
>> single file, 1 stripe: 133.6 [s]
>> single file, best result: 17.2 [s]
>>
>> (I did multiple runs with various combinations of strip count and
>> size, presenting the best results I have obtained.)
>>
>> Increasing the number of stripes obviously helped a lot, but
>> comparing with the separate-files strategy, the writing time is still
>> more than ten times slower . Do you think it is "normal"?
>
> It might be "normal" for Lustre, but it's not good. I wish I had more
> experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> The ADIOS folks report tuned-HDF5 to a single shared file runs about
> 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> there is room for improvement.
>
> I've asked them about the kinds of things "tuned HDF5" entails, and
> they didn't know (!).
>
> There are quite a few settings documented in the intro_mpi(3) man
> page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> you can try. I'm sorry to report that in my limited experience, the
> documentation and reality are sometimes out of sync, especially with
> respect to which settings are default or not.
>
> ==rob
>
>> Thanks,
>> Daniel
>>
>> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
>>> I've run some benchmark, where within an MPI program, each process
>>> wrote
>>> 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the
>>> following writing strategies:
>>>
>>> 1) each process writes to its own file,
>>> 2) each process writes to the same file to its own dataset,
>>> 3) each process writes to the same file to a same dataset.
>>>
>>> I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024),
>>> and I've tested 2)-3) for both independent/collective options of the
>>> MPI driver. I've also used 3 different clusters for measurements
>>> (all quite modern).
>>>
>>> As a result, the running (storage) times of the same-file strategy, i.e.
>>> 2) and 3), were of orders of magnitudes longer than the running
>>> times of the separate-files strategy. For illustration:
>>>
>>> cluster #1, 512 MPI processes, each process stores 100 MB of data,
>>> fixed data sets:
>>>
>>> 1) separate files: 2.73 [s]
>>> 2) single file, independent calls, separate data sets: 88.54[s]
>>>
>>> cluster #2, 256 MPI processes, each process stores 100 MB of data,
>>> chunked data sets (chunk size 1024):
>>>
>>> 1) separate files: 10.40 [s]
>>> 2) single file, independent calls, shared data sets: 295 [s]
>>> 3) single file, collective calls, shared data sets: 3275 [s]
>>>
>>> Any idea why the single-file strategy gives so poor writing performance?
>>>
>>> Daniel
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@lists.hdfgroup.org
>> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgr
>> oup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Rob

Thanks for the info regarding settings and IOR config etc I wil go through that in detail over the next few days.

I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this regard are little better than printf and I'm going to need to do some profiling and stepping through code to see what's going on inside hdf5.

Just FYI. I run a simple test which writes data out and I set it going using this loop, which generates slurm submission scripts for me and passes a ton of options to my test. So the scripts run jobs on all node counts and procspercore count from 1-64. Since the machine is not yet in production, I can get a lot of this done now.

for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
do
  for NPERNODE in 1 2 4 8 16 32 64
  do
    write_script (...options)
  done
done

cmake - yes, I'm also compiling with clang, I'm not trying to make anything easy for myself here :slight_smile:

JB

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf
Of Rob Latham
Sent: 20 September 2013 17:03
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
(shared) file

On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:
> This morning, I did some poking around and found that the cmake based
> configure of hdf has a nasty bug that causes H5_HAVE_GPFS to be set to
> false and no GPFS optimizations are compiled in (libgpfs is not
> detected). Having tweaked that, you can imagine my happiness when I
> recompiled everything and now I'm getting even worse Bandwidth.

Thanks for the report on those hints. HDF5 contains, outside of gpfs-specific
benchmarks, one of the few implementations of all the
gpfs_fcntl() tuning parameters. Given your experience, probably best
to turn off those hints.

Also, cmake works on bluegene? wow. Don't forget that bluegene requires
cross compliation.

> In fact if I enable collective IO, the app coredumps on me, so the
> situations is worse than I had feared. I'm using too much memory in my
> test I suspect and collectives are pushing me over the limit. The only
> test I can run with collective enabled is the one that uses only one
> rank and writes 16MB!

How many processes per node are you using on your BGQ? if you are loading
up with 64 procs per node, that will give each one about
200-230 MiB of scratch space.

I wonder if you have built some or all of your hdf5 library for the front end
nodes, and some or none for the compute nodes?

How many processes are you running here?

A month back I ran some one-rack experiments:
https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar
e.png

Here's my IOR config file. Note two tuning parameters here:
- "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
  for Blue Gene /Q
- the 'bglockless' prefix is "robl's secret turbo button". it was fun
  to pull that rabbit out of the hat... for the first few years.
  (it's not the default because in one specific case performance is
  shockingly poor).

IOR START
        numTasks=65536
        repetitions=3
        reorderTasksConstant=1024
        fsync=1
        transferSize=6M
        blockSize=6M
        collective=1
        showHints=1
        hintsFileName=IOR-hints-bg_nodes_pset.64
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
api.mpi
        api=MPIIO
        RUN
        api=HDF5
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
api.h5
        RUN
        api=NCMPI
        testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
api.nc
        RUN
IOR STOP

> Rob : you mentioned some fcntl functions were deprecated etc. do I
> need to remove these to stop the coredumps? (I'm very much hoping
> something has gone wrong with my tests because the performance is
> shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

Unless you are running BGQ system software driver V1R2M1, the gpfs_fcntl
hints do not get forwarded to storage, and return an error.
It's possible HDF5 responds to that error with a core dump?

==rob

> JB
>
> > -----Original Message----- From: Hdf-forum
> > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel
> > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> > single (shared) file
> >
> > Rob,
> >
> > thanks a lot for hints. I will look at the suggested option and try
> > some experiments with it :).
> >
> > Daniel
> >
> >
> >
> > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s] single
> > >> file, best result: 17.2 [s]
> > >>
> > >> (I did multiple runs with various combinations of strip count and
> > >> size, presenting the best results I have obtained.)
> > >>
> > >> Increasing the number of stripes obviously helped a lot, but
> > >> comparing with the separate-files strategy, the writing time is
> > >> still more than ten times slower . Do you think it is "normal"?
> > >
> > > It might be "normal" for Lustre, but it's not good. I wish I had
> > > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > > The ADIOS folks report tuned-HDF5 to a single shared file runs
> > > about 60% slower than ADIOS to multiple files, not 10x slower, so
> > > it seems there is room for improvement.
> > >
> > > I've asked them about the kinds of things "tuned HDF5" entails,
> > > and they didn't know (!).
> > >
> > > There are quite a few settings documented in the intro_mpi(3) man
> > > page. MPICH_MPIIO_CB_ALIGN will probably be the most important
> > > thing you can try. I'm sorry to report that in my limited
> > > experience, the documentation and reality are sometimes out of
> > > sync, especially with respect to which settings are default or
> > > not.
> > >
> > > ==rob
> > >
> > >> Thanks, Daniel
> > >>
> > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > >>> I've run some benchmark, where within an MPI program, each
> > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> > >>> I've used the following writing strategies:
> > >>>
> > >>> 1) each process writes to its own file, 2) each process writes
> > >>> to the same file to its own dataset, 3) each process writes to
> > >>> the same file to a same dataset.
> > >>>
> > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> > >>> 1024), and I've tested 2)-3) for both independent/collective
> > >>> options of the MPI driver. I've also used 3 different clusters
> > >>> for measurements (all quite modern).
> > >>>
> > >>> As a result, the running (storage) times of the same-file
> > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer
> > >>> than the running times of the separate-files strategy. For
> > >>> illustration:
> > >>>
> > >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> > >>> data, fixed data sets:
> > >>>
> > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> > >>> separate data sets: 88.54[s]
> > >>>
> > >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> > >>> data, chunked data sets (chunk size 1024):
> > >>>
> > >>> 1) separate files: 10.40 [s] 2) single file, independent calls,
> > >>> shared data sets: 295 [s] 3) single file, collective calls,
> > >>> shared data sets: 3275 [s]
> > >>>
> > >>> Any idea why the single-file strategy gives so poor writing
> > >>> performance?
> > >>>
> > >>> Daniel
> > >>
> > >> _______________________________________________ Hdf-
forum is for
> > >> HDF software users discussion.
> > >> Hdf-forum@lists.hdfgroup.org
> > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.h
> > >> dfgr
> > >> oup.org
> > >
> >
> > _______________________________________________ Hdf-forum
is for HDF
> > software users discussion. Hdf-forum@lists.hdfgroup.org
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg
> > roup.org
>
> _______________________________________________ Hdf-forum is
for HDF
> software users discussion. Hdf-forum@lists.hdfgroup.org
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
> up.org

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Hi Rob,

> * Shared file performance will almost always be slower than file per
> process performance because of the need for extent locking by the file
> system.

The extent locks are a problem, yes, but let's not discount the
overhead of creating N files. Now, on GPFS the story is abysmal,
and maybe Lustre creates files in /close to/ no time flat, but
creating a file does cost something.

Lustre is pretty fast with file creates. We are measuring on the order
of 20,000 creates/second on some configurations. But that still means
50 seconds for a million files (I'm thinking exascale). Worse is just
the management of that many files. So I am definitely an advocate of
single shared files. But users, of course, want the best of both worlds,
single file at FPP speed. I believe they will, in reality, accept 50%,
but not 10% or 1%.

> * What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
> interface and the HDF5 interface, is that the MPI_File_set_size() call
> and the metadata writes done by HDF5 are both taking a lot of extra time.

Quincey will have to speak to this one, but I thought they greatly
reduced the number of MPI_File_set_size() calls in a recent release?

Yes, they did. At least in my IOR test, there was just one call made by
each rank. But that still took a lot of time. See next comment also.

> * MPI_File_set_size eventually calls ftruncate(), which has been reported
> to take a long time on Lustre file systems.

The biggest problem with MPI_File_set_size and ftruncate is that
MPI_File_set_size is collective. I don't know what changes Cray's
made to ROMIO, but for a long time ROMIO's had a "call ftruncate on
one processor" optimization. David can confirm if ADIOI_GEN_Resize or
it's equivalent contains that optimization.

Yes, the Cray implementation does have that optimization. Given that,
it is still very surprising that just one call by one rank will still
take so much time.

> * The metadata is written by multiple processes in small records to the
> same regions of the file. Some metadata always goes to the beginning of
> the file but some is written to other parts of the file. Both cause a
> lot of lock contention, which slows performance.

I've bugged the HDF5 guys about this since 2008. It's work in
progress under ExaHDF5 (i think), so there's hope that we will see a
scalable metadata approach soon.

I have begun a conversation with the HDF Group about this. Perhaps some
help from me on the MPI-IO side will make it easier for them to do it
sooner.

> I still need to verify what I think I am seeing. I don't know yet what
> can be done about either of these. But 5x or 10x slowdown is not
> acceptable.

One thing that is missing is a good understanding of what the MPT
tuning knobs can and cannot do. Another thing that is missing is a
way to detangle the I/O stack: if we had some way to say "this app
spent X% in hdf5-related things, Y% in MPI-IO things and Z% in
lustre-things", that would go a long way towards directing effort.

I used a combination of some internal tools to get a very clear picture
of where the time is spent. From this analysis, I have concluded that
there is nothing that MPI-IO can do to improve this. It will require,
I believe, a change to HDF5 for the metadata issue so that the metadata
can be aggregated, and a change to Lustre so that the ftruncate isn't so
slow. I will also working the Lustre issue with Lustre developers.

Have you seen the work we've done witn Darshan lately? Darshan had
some bad experiences on Lustre a few years back, but Phil Carns and
Yushu Yao have really whipped it into shape for Hopper (see phil and
yushu's recent CUG paper). It'd be nice to have Darshan on more Cray
systems. It's been a huge asset on Argonne's Blue Gene machines.

I became aware of Darshan a while ago but until this week, I have not
used it. I have now built it and will begin using it to see what else
I can learn about the HDF5 performance from it.

Thanks for your comments.
David

···

On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:

On Wed, Sep 18, 2013 at 09:04:49AM -0500, Rob Latham wrote:

> On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > separate files: 1.36 [s]
> > > single file, 1 stripe: 133.6 [s]
> > > single file, best result: 17.2 [s]
> > >
> > > (I did multiple runs with various combinations of strip count and
> > > size, presenting the best results I have obtained.)
> > >
> > > Increasing the number of stripes obviously helped a lot, but
> > > comparing with the separate-files strategy, the writing time is
> > > still more than ten times slower . Do you think it is "normal"?
> >
> > It might be "normal" for Lustre, but it's not good. I wish I had
> > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > The ADIOS folks report tuned-HDF5 to a single shared file runs about
> > 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> > there is room for improvement.
> >
> > I've asked them about the kinds of things "tuned HDF5" entails, and
> > they didn't know (!).
> >
> > There are quite a few settings documented in the intro_mpi(3) man
> > page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> > you can try. I'm sorry to report that in my limited experience, the
> > documentation and reality are sometimes out of sync, especially with
> > respect to which settings are default or not.
> >
> > ==rob
> >
> > > Thanks,
> > > Daniel
> > >
> > > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > >I've run some benchmark, where within an MPI program, each process wrote
> > > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
> > > >writing strategies:
> > > >
> > > >1) each process writes to its own file,
> > > >2) each process writes to the same file to its own dataset,
> > > >3) each process writes to the same file to a same dataset.
> > > >
> > > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
> > > >I've tested 2)-3) for both independent/collective options of the MPI
> > > >driver. I've also used 3 different clusters for measurements (all quite
> > > >modern).
> > > >
> > > >As a result, the running (storage) times of the same-file strategy, i.e.
> > > >2) and 3), were of orders of magnitudes longer than the running times of
> > > >the separate-files strategy. For illustration:
> > > >
> > > >cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
> > > >data sets:
> > > >
> > > >1) separate files: 2.73 [s]
> > > >2) single file, independent calls, separate data sets: 88.54[s]
> > > >
> > > >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> > > >chunked data sets (chunk size 1024):
> > > >
> > > >1) separate files: 10.40 [s]
> > > >2) single file, independent calls, shared data sets: 295 [s]
> > > >3) single file, collective calls, shared data sets: 3275 [s]
> > > >
> > > >Any idea why the single-file strategy gives so poor writing performance?
> > > >
> > > >Daniel
> > >
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > Hdf-forum@lists.hdfgroup.org
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > Hdf-forum@lists.hdfgroup.org
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--

Rob,

Over the last couple of days, I've been able to rerun tests (here using h5perf) with the bglockless flag

export BGLOCKLESSMPIO_F_TYPE=0x47504653

and the results are greatly improved. Attached is one page of plots where we get up to 30GB/s which compares to just over 40 with IOR, so in the right range compared to expectations.

The difference that one flag can make is quite impressive. People need to know this!

Thanks

JB

[cid:image001.jpg@01CEBBA2.CC085FC0]

···

-----Original Message-----

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf

Of Biddiscombe, John A.

Sent: 20 September 2013 21:47

To: HDF Users Discussion List

Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single

(shared) file

Rob

Thanks for the info regarding settings and IOR config etc I wil go through that

in detail over the next few days.

I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this

regard are little better than printf and I'm going to need to do some profiling

and stepping through code to see what's going on inside hdf5.

Just FYI. I run a simple test which writes data out and I set it going using this

loop, which generates slurm submission scripts for me and passes a ton of

options to my test. So the scripts run jobs on all node counts and

procspercore count from 1-64. Since the machine is not yet in production, I

can get a lot of this done now.

for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do

  for NPERNODE in 1 2 4 8 16 32 64

  do

    write_script (...options)

  done

done

cmake - yes, I'm also compiling with clang, I'm not trying to make anything

easy for myself here :slight_smile:

JB

> -----Original Message-----

> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On

> Behalf Of Rob Latham

> Sent: 20 September 2013 17:03

> To: HDF Users Discussion List

> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> single

> (shared) file

>

> On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

> > This morning, I did some poking around and found that the cmake

> > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to

> > be set to false and no GPFS optimizations are compiled in (libgpfs

> > is not detected). Having tweaked that, you can imagine my happiness

> > when I recompiled everything and now I'm getting even worse

Bandwidth.

>

> Thanks for the report on those hints. HDF5 contains, outside of

> gpfs-specific benchmarks, one of the few implementations of all the

> gpfs_fcntl() tuning parameters. Given your experience, probably best

> to turn off those hints.

>

> Also, cmake works on bluegene? wow. Don't forget that bluegene

> requires cross compliation.

>

> > In fact if I enable collective IO, the app coredumps on me, so the

> > situations is worse than I had feared. I'm using too much memory in

> > my test I suspect and collectives are pushing me over the limit. The

> > only test I can run with collective enabled is the one that uses

> > only one rank and writes 16MB!

>

> How many processes per node are you using on your BGQ? if you are

> loading up with 64 procs per node, that will give each one about

> 200-230 MiB of scratch space.

>

> I wonder if you have built some or all of your hdf5 library for the

> front end nodes, and some or none for the compute nodes?

>

> How many processes are you running here?

>

> A month back I ran some one-rack experiments:

>

Dropbox - mira_hinted_api_compare.png - Simplify your life

> e.png

>

> Here's my IOR config file. Note two tuning parameters here:

> - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low

> for Blue Gene /Q

> - the 'bglockless' prefix is "robl's secret turbo button". it was fun

> to pull that rabbit out of the hat... for the first few years.

> (it's not the default because in one specific case performance is

> shockingly poor).

>

> IOR START

> numTasks=65536

> repetitions=3

> reorderTasksConstant=1024

> fsync=1

> transferSize=6M

> blockSize=6M

> collective=1

> showHints=1

> hintsFileName=IOR-hints-bg_nodes_pset.64

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.mpi

> api=MPIIO

> RUN

> api=HDF5

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.h5

> RUN

> api=NCMPI

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.nc

> RUN

> IOR STOP

>

>

> > Rob : you mentioned some fcntl functions were deprecated etc. do I

> > need to remove these to stop the coredumps? (I'm very much hoping

> > something has gone wrong with my tests because the performance is

> > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

>

> Unless you are running BGQ system software driver V1R2M1, the

> gpfs_fcntl hints do not get forwarded to storage, and return an error.

> It's possible HDF5 responds to that error with a core dump?

>

> ==rob

>

>

> > JB

> >

> > > -----Original Message----- From: Hdf-forum

> > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel

> > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List

> > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > > single (shared) file

> > >

> > > Rob,

> > >

> > > thanks a lot for hints. I will look at the suggested option and

> > > try some experiments with it :).

> > >

> > > Daniel

> > >

> > >

> > >

> > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

> > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

> > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]

> > > >> single file, best result: 17.2 [s]

> > > >>

> > > >> (I did multiple runs with various combinations of strip count

> > > >> and size, presenting the best results I have obtained.)

> > > >>

> > > >> Increasing the number of stripes obviously helped a lot, but

> > > >> comparing with the separate-files strategy, the writing time is

> > > >> still more than ten times slower . Do you think it is "normal"?

> > > >

> > > > It might be "normal" for Lustre, but it's not good. I wish I

> > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do

not.

> > > > The ADIOS folks report tuned-HDF5 to a single shared file runs

> > > > about 60% slower than ADIOS to multiple files, not 10x slower,

> > > > so it seems there is room for improvement.

> > > >

> > > > I've asked them about the kinds of things "tuned HDF5" entails,

> > > > and they didn't know (!).

> > > >

> > > > There are quite a few settings documented in the intro_mpi(3)

> > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most

> > > > important thing you can try. I'm sorry to report that in my

> > > > limited experience, the documentation and reality are sometimes

> > > > out of sync, especially with respect to which settings are

> > > > default or not.

> > > >

> > > > ==rob

> > > >

> > > >> Thanks, Daniel

> > > >>

> > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> > > >>> I've run some benchmark, where within an MPI program, each

> > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.

> > > >>> I've used the following writing strategies:

> > > >>>

> > > >>> 1) each process writes to its own file, 2) each process writes

> > > >>> to the same file to its own dataset, 3) each process writes to

> > > >>> the same file to a same dataset.

> > > >>>

> > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size

> > > >>> 1024), and I've tested 2)-3) for both independent/collective

> > > >>> options of the MPI driver. I've also used 3 different clusters

> > > >>> for measurements (all quite modern).

> > > >>>

> > > >>> As a result, the running (storage) times of the same-file

> > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer

> > > >>> than the running times of the separate-files strategy. For

> > > >>> illustration:

> > > >>>

> > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of

> > > >>> data, fixed data sets:

> > > >>>

> > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,

> > > >>> separate data sets: 88.54[s]

> > > >>>

> > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of

> > > >>> data, chunked data sets (chunk size 1024):

> > > >>>

> > > >>> 1) separate files: 10.40 [s] 2) single file, independent

> > > >>> calls, shared data sets: 295 [s] 3) single file, collective

> > > >>> calls, shared data sets: 3275 [s]

> > > >>>

> > > >>> Any idea why the single-file strategy gives so poor writing

> > > >>> performance?

> > > >>>

> > > >>> Daniel

> > > >>

> > > >> _______________________________________________ Hdf-

> forum is for

> > > >> HDF software users discussion.

> > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists

> > > >> .h

> > > >> dfgr

> > > >> oup.org

> > > >

> > >

> > > _______________________________________________ Hdf-

forum

> is for HDF

> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd

> > > fg

> > > roup.org

> >

> > _______________________________________________ Hdf-forum

is

> for HDF

> > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg

> > ro

> > up.org

>

> --

> Rob Latham

> Mathematics and Computer Science Division Argonne National Lab, IL USA

>

> _______________________________________________

> Hdf-forum is for HDF software users discussion.

> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro

> up.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

I have been following this thread with interest since we have the same issue in the synchrotron community, with new detectors generating 100's-1000's of 2D frames/sec and total rates approaching 10 GB/sec using multiple parallel 10 GbE streams from different detector nodes. What we have found is:

- Lustre is better at managing the pHDF5 contention between nodes than GPFS is.
- GPFS is better at streaming data from one node, if there is no contention.
- Having the nodes write to separate files is better than using pHDF5 to enable all nodes to write to one.

"Better" means a factor of 2-3 times, but we are still actively learning and we have more experience with Lustre than GPFS, so there may be some GPFS tweaks we are missing. The storage systems are comparable, both based on DDN SFA architecture and have ample throughput in simple "ior" tests. I think GPFS would also be comparable to Lustre at managing contention if all the data originated from one node, but we haven't been looking at this.

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

An additional benefit is the data can be compressed, so data rates may be able to be reduced drastically by compression, depending on your situation.

The status is that we have Draft RFC outlining the requirements, use cases and programming model, and The HDF Group is preparing an estimate. The work is not funded (I will be making a case to my directors for some of it), but if it strikes a chord I would be only too willing to share the RFC, particularly if there is any possibility of support coming available.

Cheers,

Nick Rees
Principal Software Engineer Phone: +44 (0)1235-778430
Diamond Light Source Fax: +44 (0)1235-446713

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of David Knaak
Sent: 19 September 2013 00:45
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Hi Rob,

On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:
> * Shared file performance will almost always be slower than file per
> process performance because of the need for extent locking by the
> file system.

On Wed, Sep 18, 2013 at 09:04:49AM -0500, Rob Latham wrote:

The extent locks are a problem, yes, but let's not discount the
overhead of creating N files. Now, on GPFS the story is abysmal, and
maybe Lustre creates files in /close to/ no time flat, but creating a
file does cost something.

Lustre is pretty fast with file creates. We are measuring on the order of 20,000 creates/second on some configurations. But that still means
50 seconds for a million files (I'm thinking exascale). Worse is just the management of that many files. So I am definitely an advocate of single shared files. But users, of course, want the best of both worlds, single file at FPP speed. I believe they will, in reality, accept 50%, but not 10% or 1%.

> * What I am seeing so far analyzing simple IOR runs, comparing the
> MPIIO interface and the HDF5 interface, is that the
> MPI_File_set_size() call and the metadata writes done by HDF5 are both taking a lot of extra time.

Quincey will have to speak to this one, but I thought they greatly
reduced the number of MPI_File_set_size() calls in a recent release?

Yes, they did. At least in my IOR test, there was just one call made by each rank. But that still took a lot of time. See next comment also.

> * MPI_File_set_size eventually calls ftruncate(), which has been
> reported to take a long time on Lustre file systems.

The biggest problem with MPI_File_set_size and ftruncate is that
MPI_File_set_size is collective. I don't know what changes Cray's
made to ROMIO, but for a long time ROMIO's had a "call ftruncate on
one processor" optimization. David can confirm if ADIOI_GEN_Resize or
it's equivalent contains that optimization.

Yes, the Cray implementation does have that optimization. Given that, it is still very surprising that just one call by one rank will still take so much time.

> * The metadata is written by multiple processes in small records to
> the same regions of the file. Some metadata always goes to the
> beginning of the file but some is written to other parts of the
> file. Both cause a lot of lock contention, which slows performance.

I've bugged the HDF5 guys about this since 2008. It's work in
progress under ExaHDF5 (i think), so there's hope that we will see a
scalable metadata approach soon.

I have begun a conversation with the HDF Group about this. Perhaps some help from me on the MPI-IO side will make it easier for them to do it sooner.

> I still need to verify what I think I am seeing. I don't know yet
> what can be done about either of these. But 5x or 10x slowdown is
> not acceptable.

One thing that is missing is a good understanding of what the MPT
tuning knobs can and cannot do. Another thing that is missing is a
way to detangle the I/O stack: if we had some way to say "this app
spent X% in hdf5-related things, Y% in MPI-IO things and Z% in
lustre-things", that would go a long way towards directing effort.

I used a combination of some internal tools to get a very clear picture of where the time is spent. From this analysis, I have concluded that there is nothing that MPI-IO can do to improve this. It will require, I believe, a change to HDF5 for the metadata issue so that the metadata can be aggregated, and a change to Lustre so that the ftruncate isn't so slow. I will also working the Lustre issue with Lustre developers.

Have you seen the work we've done witn Darshan lately? Darshan had
some bad experiences on Lustre a few years back, but Phil Carns and
Yushu Yao have really whipped it into shape for Hopper (see phil and
yushu's recent CUG paper). It'd be nice to have Darshan on more Cray
systems. It's been a huge asset on Argonne's Blue Gene machines.

I became aware of Darshan a while ago but until this week, I have not used it. I have now built it and will begin using it to see what else I can learn about the HDF5 performance from it.

Thanks for your comments.
David

> On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > separate files: 1.36 [s]
> > > single file, 1 stripe: 133.6 [s] single file, best result: 17.2
> > > [s]
> > >
> > > (I did multiple runs with various combinations of strip count
> > > and size, presenting the best results I have obtained.)
> > >
> > > Increasing the number of stripes obviously helped a lot, but
> > > comparing with the separate-files strategy, the writing time is
> > > still more than ten times slower . Do you think it is "normal"?
> >
> > It might be "normal" for Lustre, but it's not good. I wish I had
> > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > The ADIOS folks report tuned-HDF5 to a single shared file runs
> > about 60% slower than ADIOS to multiple files, not 10x slower, so
> > it seems there is room for improvement.
> >
> > I've asked them about the kinds of things "tuned HDF5" entails,
> > and they didn't know (!).
> >
> > There are quite a few settings documented in the intro_mpi(3) man
> > page. MPICH_MPIIO_CB_ALIGN will probably be the most important
> > thing you can try. I'm sorry to report that in my limited
> > experience, the documentation and reality are sometimes out of
> > sync, especially with respect to which settings are default or not.
> >
> > ==rob
> >
> > > Thanks,
> > > Daniel
> > >
> > > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > >I've run some benchmark, where within an MPI program, each
> > > >process wrote
> > > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the
> > > >following writing strategies:
> > > >
> > > >1) each process writes to its own file,
> > > >2) each process writes to the same file to its own dataset,
> > > >3) each process writes to the same file to a same dataset.
> > > >
> > > >I've tested 1)-3) for both fixed/chunked datasets (chunk size
> > > >1024), and I've tested 2)-3) for both independent/collective
> > > >options of the MPI driver. I've also used 3 different clusters
> > > >for measurements (all quite modern).
> > > >
> > > >As a result, the running (storage) times of the same-file strategy, i.e.
> > > >2) and 3), were of orders of magnitudes longer than the running
> > > >times of the separate-files strategy. For illustration:
> > > >
> > > >cluster #1, 512 MPI processes, each process stores 100 MB of
> > > >data, fixed data sets:
> > > >
> > > >1) separate files: 2.73 [s]
> > > >2) single file, independent calls, separate data sets: 88.54[s]
> > > >
> > > >cluster #2, 256 MPI processes, each process stores 100 MB of
> > > >data, chunked data sets (chunk size 1024):
> > > >
> > > >1) separate files: 10.40 [s]
> > > >2) single file, independent calls, shared data sets: 295 [s]
> > > >3) single file, collective calls, shared data sets: 3275 [s]
> > > >
> > > >Any idea why the single-file strategy gives so poor writing performance?
> > > >
> > > >Daniel
> > >
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > Hdf-forum@lists.hdfgroup.org
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.
> > > hdfgroup.org
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division Argonne National Lab, IL
> > USA
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > Hdf-forum@lists.hdfgroup.org
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd
> > fgroup.org
>

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
up.org

--

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom