Very poor performance of pHDF5 when using single (shared) file

Hi Daniel,

I'm not sure what the issue with the forum email list is, but nobody seems to have this problem. Just make sure you are always sending your messages and replies to hdf-forum@lists.hdfgroup.org; not another address.
I'll ask the sysadmins to look into this issue more.

Now to your results, the multiple file strategy is always (at least in most cases) going to be the fastest strategy. There are no locking contention, and not inter-process communication overhead.
The difference in performance with the single file strategy still seems a bit high in your case, but again I'm saying this with a total lack of knowledge on how your benchmark/application is accessing the file. I do not believe chunking will help here.

One thing worth trying is varying the number of MPI aggregators. What MPI library are you using? The MPI IO library is most probably ROMIO, so it should accepts info hints (not sure if the top level implementation might ignore those hints, but you can check anyway).
So use an MPI info object, that you pass in H5Pset_fapl_mpio(), to set the number of MPI aggregators (cb_nodes, and cb_buffer_size). A full list of hints to ROMIO can be found here:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide.pdf
I would set the cb_nodes to the stripe count; and try the cb_buffer_size as the stripe size. Those are not necessary the ideal options, but best to start there.

I know that all this tuning is a burden for an application user of HDF5, but that is what needs to be done today to get good performance. There have been some work done aimed at auto tuning all this parameter space using a separate tool, but the architecture is not user friendly yet for someone to simply grab, deploy and run.

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Langr
Sent: Tuesday, September 03, 2013 10:38 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Mohamad,

I really do not understand how to reply to this forum :(. I tried to reply to your post, which I received via e-mail. In this e-mail, there was the following note:

"
If you reply to this email, your message will be added to the discussion
below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"

So, I replied to this e-mail, and received another one:

Subject: Delivery Status Notification (Failure)

"
Delivery to the following recipient failed permanently:
ml-node+s184993n4026449h5@n3.nabble.com

Your email to ml-node+s184993n4026449h5@n3.nabble.com has been rejected
because you are not allowed to post to
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
. Please contact the owner about permissions or visit the Nabble Support
forum.
"

What the hell... why does it say I should reply and then that I am not
allowed to post to my own thread???

Anyway, I tried to post the following information:

I did some experiments yesterday using the BlueWaters cluster. The
stripe count is limited there to 160. For runs with 256 MPI
processes/cores and fixed datasets were the writing times:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and size,
presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but comparing
with the separate-files strategy, the writing time is still more than
ten times slower . Do you think it is "normal"?

Might chunking help here?

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Daniel,
I think you missed a very important paragraph at the top of the page:
"See www.hdfgroup.org for more information on HDF. The Nabble interface here is used for read-only access to the list archives. If you'd like to send messages to hdf-forum, you must be subscribed to the actual mailing list and send your messages through that email interface."
http://hdf-forum.184993.n3.nabble.com/

The Nabble.com site is just to show the messages. You can't use that site to reply to any messages to the mailing list. You must send an email to the mailing list itself address hdf-forum@lists.hdfgroup.org via your own client (Gmail, Outlook, Apple Mail, etc).

From your previous emails and your use of the word "forum", it seems you think nabble.com is a online forum for HDF (where users can post and respond). That is not the case. The hdf-forum is a mailing list, not an online forum.

I hope this clears things up.
Regards,
-Corey

···

On Sep 3, 2013, at 12:14 PM, Mohamad Chaarawi wrote:

Hi Daniel,

I'm not sure what the issue with the forum email list is, but nobody seems to have this problem. Just make sure you are always sending your messages and replies to hdf-forum@lists.hdfgroup.org; not another address.
I'll ask the sysadmins to look into this issue more.

Now to your results, the multiple file strategy is always (at least in most cases) going to be the fastest strategy. There are no locking contention, and not inter-process communication overhead.
The difference in performance with the single file strategy still seems a bit high in your case, but again I'm saying this with a total lack of knowledge on how your benchmark/application is accessing the file. I do not believe chunking will help here.

One thing worth trying is varying the number of MPI aggregators. What MPI library are you using? The MPI IO library is most probably ROMIO, so it should accepts info hints (not sure if the top level implementation might ignore those hints, but you can check anyway).
So use an MPI info object, that you pass in H5Pset_fapl_mpio(), to set the number of MPI aggregators (cb_nodes, and cb_buffer_size). A full list of hints to ROMIO can be found here:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide.pdf
I would set the cb_nodes to the stripe count; and try the cb_buffer_size as the stripe size. Those are not necessary the ideal options, but best to start there.

I know that all this tuning is a burden for an application user of HDF5, but that is what needs to be done today to get good performance. There have been some work done aimed at auto tuning all this parameter space using a separate tool, but the architecture is not user friendly yet for someone to simply grab, deploy and run.

Thanks,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Langr
Sent: Tuesday, September 03, 2013 10:38 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Mohamad,

I really do not understand how to reply to this forum :(. I tried to reply to your post, which I received via e-mail. In this e-mail, there was the following note:

"
If you reply to this email, your message will be added to the discussion
below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"

So, I replied to this e-mail, and received another one:

Subject: Delivery Status Notification (Failure)

"
Delivery to the following recipient failed permanently:
ml-node+s184993n4026449h5@n3.nabble.com

Your email to ml-node+s184993n4026449h5@n3.nabble.com has been rejected
because you are not allowed to post to
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
. Please contact the owner about permissions or visit the Nabble Support
forum.
"

What the hell... why does it say I should reply and then that I am not
allowed to post to my own thread???

Anyway, I tried to post the following information:

I did some experiments yesterday using the BlueWaters cluster. The
stripe count is limited there to 160. For runs with 256 MPI
processes/cores and fixed datasets were the writing times:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and size,
presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but comparing
with the separate-files strategy, the writing time is still more than
ten times slower . Do you think it is "normal"?

Might chunking help here?

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Corey Bettenhausen
Science Systems and Applications, Inc
NASA Goddard Space Flight Center
301 614 5383
corey.bettenhausen@ssaihq.com

Hi Daniel,

As Mohamad eluded to, we have developed a framework for auto-tuning HDF5 applications which is going to be presented at this year's Supercomputing conference:
http://sc13.supercomputing.org/schedule/event_detail.php?evid=pap511

And I have recently installed this framework on Bluewaters. In case you are interested in increasing the I/O performance of your application more, I think I will be able to help you. You can contact me directly to follow-up.

Thanks,
Babak

···

On 09/03/2013 11:14 AM, Mohamad Chaarawi wrote:

Hi Daniel,

I'm not sure what the issue with the forum email list is, but nobody seems to have this problem. Just make sure you are always sending your messages and replies to hdf-forum@lists.hdfgroup.org; not another address.
I'll ask the sysadmins to look into this issue more.

Now to your results, the multiple file strategy is always (at least in most cases) going to be the fastest strategy. There are no locking contention, and not inter-process communication overhead.
The difference in performance with the single file strategy still seems a bit high in your case, but again I'm saying this with a total lack of knowledge on how your benchmark/application is accessing the file. I do not believe chunking will help here.

One thing worth trying is varying the number of MPI aggregators. What MPI library are you using? The MPI IO library is most probably ROMIO, so it should accepts info hints (not sure if the top level implementation might ignore those hints, but you can check anyway).
So use an MPI info object, that you pass in H5Pset_fapl_mpio(), to set the number of MPI aggregators (cb_nodes, and cb_buffer_size). A full list of hints to ROMIO can be found here:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide.pdf
I would set the cb_nodes to the stripe count; and try the cb_buffer_size as the stripe size. Those are not necessary the ideal options, but best to start there.

I know that all this tuning is a burden for an application user of HDF5, but that is what needs to be done today to get good performance. There have been some work done aimed at auto tuning all this parameter space using a separate tool, but the architecture is not user friendly yet for someone to simply grab, deploy and run.

Thanks,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Langr
Sent: Tuesday, September 03, 2013 10:38 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Mohamad,

I really do not understand how to reply to this forum :(. I tried to reply to your post, which I received via e-mail. In this e-mail, there was the following note:

"
If you reply to this email, your message will be added to the discussion
below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"

So, I replied to this e-mail, and received another one:

Subject: Delivery Status Notification (Failure)

"
Delivery to the following recipient failed permanently:
ml-node+s184993n4026449h5@n3.nabble.com

Your email to ml-node+s184993n4026449h5@n3.nabble.com has been rejected
because you are not allowed to post to
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
. Please contact the owner about permissions or visit the Nabble Support
forum.
"

What the hell... why does it say I should reply and then that I am not
allowed to post to my own thread???

Anyway, I tried to post the following information:

I did some experiments yesterday using the BlueWaters cluster. The
stripe count is limited there to 160. For runs with 256 MPI
processes/cores and fixed datasets were the writing times:

separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]

(I did multiple runs with various combinations of strip count and size,
presenting the best results I have obtained.)

Increasing the number of stripes obviously helped a lot, but comparing
with the separate-files strategy, the writing time is still more than
ten times slower . Do you think it is "normal"?

Might chunking help here?

Thanks,
Daniel

Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:

1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.

I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).

As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:

cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:

1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]

cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):

1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]

Any idea why the single-file strategy gives so poor writing performance?

Daniel

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org