Hi Daniel,
I'm not sure what the issue with the forum email list is, but nobody seems to have this problem. Just make sure you are always sending your messages and replies to hdf-forum@lists.hdfgroup.org; not another address.
I'll ask the sysadmins to look into this issue more.
Now to your results, the multiple file strategy is always (at least in most cases) going to be the fastest strategy. There are no locking contention, and not inter-process communication overhead.
The difference in performance with the single file strategy still seems a bit high in your case, but again I'm saying this with a total lack of knowledge on how your benchmark/application is accessing the file. I do not believe chunking will help here.
One thing worth trying is varying the number of MPI aggregators. What MPI library are you using? The MPI IO library is most probably ROMIO, so it should accepts info hints (not sure if the top level implementation might ignore those hints, but you can check anyway).
So use an MPI info object, that you pass in H5Pset_fapl_mpio(), to set the number of MPI aggregators (cb_nodes, and cb_buffer_size). A full list of hints to ROMIO can be found here:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide.pdf
I would set the cb_nodes to the stripe count; and try the cb_buffer_size as the stripe size. Those are not necessary the ideal options, but best to start there.
I know that all this tuning is a burden for an application user of HDF5, but that is what needs to be done today to get good performance. There have been some work done aimed at auto tuning all this parameter space using a separate tool, but the architecture is not user friendly yet for someone to simply grab, deploy and run.
Thanks,
Mohamad
···
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel Langr
Sent: Tuesday, September 03, 2013 10:38 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file
Mohamad,
I really do not understand how to reply to this forum :(. I tried to reply to your post, which I received via e-mail. In this e-mail, there was the following note:
"
If you reply to this email, your message will be added to the discussion
below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"
So, I replied to this e-mail, and received another one:
Subject: Delivery Status Notification (Failure)
"
Delivery to the following recipient failed permanently:
ml-node+s184993n4026449h5@n3.nabble.com
Your email to ml-node+s184993n4026449h5@n3.nabble.com has been rejected
because you are not allowed to post to
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
. Please contact the owner about permissions or visit the Nabble Support
forum.
"
What the hell... why does it say I should reply and then that I am not
allowed to post to my own thread???
Anyway, I tried to post the following information:
I did some experiments yesterday using the BlueWaters cluster. The
stripe count is limited there to 160. For runs with 256 MPI
processes/cores and fixed datasets were the writing times:
separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]
(I did multiple runs with various combinations of strip count and size,
presenting the best results I have obtained.)
Increasing the number of stripes obviously helped a lot, but comparing
with the separate-files strategy, the writing time is still more than
ten times slower . Do you think it is "normal"?
Might chunking help here?
Thanks,
Daniel
Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]Any idea why the single-file strategy gives so poor writing performance?
Daniel
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org