Very poor performance of pHDF5 when using single (shared) file

robl · September 27, 2013, 3:14pm

The difference that one flag can make is quite impressive. People need to know this!

Oh John Oh John.... I cannot tell you how angry that flag makes me!

'bglockless:' was supposed to be a short-term hack. It was written
for the PVFS file system (which did not support fcntl()-style locks,
or any locks at all for that matter). Then we found out it helped
GPFS on BlueGene too.

I'm going to have to just sit down for a couple-five days and send IBM
a patch removing all the locks from the default driver, and telling
anyone who wants to run MPI-IO to an NFS file system on a blue gene to
take a hike.

Thanks for the graphs. I was surprised to see that fewer than 8 cores
per node resulted in slightly *worse* performance for collective I/O.

==rob

···

On Fri, Sep 27, 2013 at 02:58:32PM +0000, Biddiscombe, John A. wrote:

[cid:image001.jpg@01CEBBA2.CC085FC0]

> -----Original Message-----

> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf

> Of Biddiscombe, John A.

> Sent: 20 September 2013 21:47

> To: HDF Users Discussion List

> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single

> (shared) file

>

> Rob

>

> Thanks for the info regarding settings and IOR config etc I wil go through that

> in detail over the next few days.

>

> I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this

> regard are little better than printf and I'm going to need to do some profiling

> and stepping through code to see what's going on inside hdf5.

>

> Just FYI. I run a simple test which writes data out and I set it going using this

> loop, which generates slurm submission scripts for me and passes a ton of

> options to my test. So the scripts run jobs on all node counts and

> procspercore count from 1-64. Since the machine is not yet in production, I

> can get a lot of this done now.

>

> for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do

> for NPERNODE in 1 2 4 8 16 32 64

> do

> write_script (...options)

> done

> done

>

> cmake - yes, I'm also compiling with clang, I'm not trying to make anything

> easy for myself here

>

> JB

>

> > -----Original Message-----

> > From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On

> > Behalf Of Rob Latham

> > Sent: 20 September 2013 17:03

> > To: HDF Users Discussion List

> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > single

> > (shared) file

> >

> > On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

> > > This morning, I did some poking around and found that the cmake

> > > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to

> > > be set to false and no GPFS optimizations are compiled in (libgpfs

> > > is not detected). Having tweaked that, you can imagine my happiness

> > > when I recompiled everything and now I'm getting even worse

> Bandwidth.

> >

> > Thanks for the report on those hints. HDF5 contains, outside of

> > gpfs-specific benchmarks, one of the few implementations of all the

> > gpfs_fcntl() tuning parameters. Given your experience, probably best

> > to turn off those hints.

> >

> > Also, cmake works on bluegene? wow. Don't forget that bluegene

> > requires cross compliation.

> >

> > > In fact if I enable collective IO, the app coredumps on me, so the

> > > situations is worse than I had feared. I'm using too much memory in

> > > my test I suspect and collectives are pushing me over the limit. The

> > > only test I can run with collective enabled is the one that uses

> > > only one rank and writes 16MB!

> >

> > How many processes per node are you using on your BGQ? if you are

> > loading up with 64 procs per node, that will give each one about

> > 200-230 MiB of scratch space.

> >

> > I wonder if you have built some or all of your hdf5 library for the

> > front end nodes, and some or none for the compute nodes?

> >

> > How many processes are you running here?

> >

> > A month back I ran some one-rack experiments:

> >

> https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar

> > e.png

> >

> > Here's my IOR config file. Note two tuning parameters here:

> > - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low

> > for Blue Gene /Q

> > - the 'bglockless' prefix is "robl's secret turbo button". it was fun

> > to pull that rabbit out of the hat... for the first few years.

> > (it's not the default because in one specific case performance is

> > shockingly poor).

> >

> > IOR START

> > numTasks=65536

> > repetitions=3

> > reorderTasksConstant=1024

> > fsync=1

> > transferSize=6M

> > blockSize=6M

> > collective=1

> > showHints=1

> > hintsFileName=IOR-hints-bg_nodes_pset.64

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.mpi

> > api=MPIIO

> > RUN

> > api=HDF5

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.h5

> > RUN

> > api=NCMPI

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.nc

> > RUN

> > IOR STOP

> >

> >

> > > Rob : you mentioned some fcntl functions were deprecated etc. do I

> > > need to remove these to stop the coredumps? (I'm very much hoping

> > > something has gone wrong with my tests because the performance is

> > > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

> >

> > Unless you are running BGQ system software driver V1R2M1, the

> > gpfs_fcntl hints do not get forwarded to storage, and return an error.

> > It's possible HDF5 responds to that error with a core dump?

> >

> > ==rob

> >

> >

> > > JB

> > >

> > > > -----Original Message----- From: Hdf-forum

> > > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel

> > > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List

> > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > > > single (shared) file

> > > >

> > > > Rob,

> > > >

> > > > thanks a lot for hints. I will look at the suggested option and

> > > > try some experiments with it :).

> > > >

> > > > Daniel

> > > >

> > > >

> > > >

> > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

> > > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

> > > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]

> > > > >> single file, best result: 17.2 [s]

> > > > >>

> > > > >> (I did multiple runs with various combinations of strip count

> > > > >> and size, presenting the best results I have obtained.)

> > > > >>

> > > > >> Increasing the number of stripes obviously helped a lot, but

> > > > >> comparing with the separate-files strategy, the writing time is

> > > > >> still more than ten times slower . Do you think it is "normal"?

> > > > >

> > > > > It might be "normal" for Lustre, but it's not good. I wish I

> > > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do

> not.

> > > > > The ADIOS folks report tuned-HDF5 to a single shared file runs

> > > > > about 60% slower than ADIOS to multiple files, not 10x slower,

> > > > > so it seems there is room for improvement.

> > > > >

> > > > > I've asked them about the kinds of things "tuned HDF5" entails,

> > > > > and they didn't know (!).

> > > > >

> > > > > There are quite a few settings documented in the intro_mpi(3)

> > > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most

> > > > > important thing you can try. I'm sorry to report that in my

> > > > > limited experience, the documentation and reality are sometimes

> > > > > out of sync, especially with respect to which settings are

> > > > > default or not.

> > > > >

> > > > > ==rob

> > > > >

> > > > >> Thanks, Daniel

> > > > >>

> > > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> > > > >>> I've run some benchmark, where within an MPI program, each

> > > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.

> > > > >>> I've used the following writing strategies:

> > > > >>>

> > > > >>> 1) each process writes to its own file, 2) each process writes

> > > > >>> to the same file to its own dataset, 3) each process writes to

> > > > >>> the same file to a same dataset.

> > > > >>>

> > > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size

> > > > >>> 1024), and I've tested 2)-3) for both independent/collective

> > > > >>> options of the MPI driver. I've also used 3 different clusters

> > > > >>> for measurements (all quite modern).

> > > > >>>

> > > > >>> As a result, the running (storage) times of the same-file

> > > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer

> > > > >>> than the running times of the separate-files strategy. For

> > > > >>> illustration:

> > > > >>>

> > > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of

> > > > >>> data, fixed data sets:

> > > > >>>

> > > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,

> > > > >>> separate data sets: 88.54[s]

> > > > >>>

> > > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of

> > > > >>> data, chunked data sets (chunk size 1024):

> > > > >>>

> > > > >>> 1) separate files: 10.40 [s] 2) single file, independent

> > > > >>> calls, shared data sets: 295 [s] 3) single file, collective

> > > > >>> calls, shared data sets: 3275 [s]

> > > > >>>

> > > > >>> Any idea why the single-file strategy gives so poor writing

> > > > >>> performance?

> > > > >>>

> > > > >>> Daniel

> > > > >>

> > > > >> _______________________________________________ Hdf-

> > forum is for

> > > > >> HDF software users discussion.

> > > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists

> > > > >> .h

> > > > >> dfgr

> > > > >> oup.org

> > > > >

> > > >

> > > > _______________________________________________ Hdf-

> forum

> > is for HDF

> > > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd

> > > > fg

> > > > roup.org

> > >

> > > _______________________________________________ Hdf-forum

> is

> > for HDF

> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg

> > > ro

> > > up.org

> >

> > --

> > Rob Latham

> > Mathematics and Computer Science Division Argonne National Lab, IL USA

> >

> > _______________________________________________

> > Hdf-forum is for HDF software users discussion.

> > Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro

> > up.org

>

> _______________________________________________

> Hdf-forum is for HDF software users discussion.

> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

John_Biddiscombe · September 27, 2013, 7:21pm

Rob

Thanks for the graphs. I was surprised to see that fewer than 8 cores per node resulted in slightly *worse* performance for collective I/O.
<

me too. My theory is that as BGQ has IO forwarding, for the small node counts (=small data transfers too) the forwarding directly is more effective than collecting first and then forwarding.

I am also a little puzzled the shape of the graphs later as independent drops and collective overtakes it. I presume the effect of latency on many writes to IONs is causing performance to drop whilst the collective mode avoids some of this.
This page of graphs is one of many for different file system configs and they all show the same pattern to greater or lesser degree. I shall do more tests I expect until I am happy that I fully understand what's going on.

JB

Nathanael_Huebbe · October 1, 2013, 10:58am

Hello John,
these graphs really look interesting. But I have one question remaining:
How much data did every process write in one go? Was this datasize fixed
for a single process? Or was the total amount of data written in each
round fixed? I. e. did you use weak or strong scaling?

Cheers,
Nathanael H�bbe

···

On 09/27/2013 04:58 PM, Biddiscombe, John A. wrote:

Rob,

Over the last couple of days, I've been able to rerun tests (here
using h5perf) with the bglockless flag

export BGLOCKLESSMPIO_F_TYPE=0x47504653

and the results are greatly improved. Attached is one page of plots
where we get up to 30GB/s which compares to just over 40 with IOR, so
in the right range compared to expectations.

The difference that one flag can make is quite impressive. People need
to know this!

Thanks

JB

> -----Original Message-----

> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf

> Of Biddiscombe, John A.

> Sent: 20 September 2013 21:47

> To: HDF Users Discussion List

> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
single

> (shared) file

>

> Rob

>

> Thanks for the info regarding settings and IOR config etc I wil go
through that

> in detail over the next few days.

>

> I plan on taking a crash course in debugging on BG/Q ASAP, my skills
in this

> regard are little better than printf and I'm going to need to do
some profiling

> and stepping through code to see what's going on inside hdf5.

>

> Just FYI. I run a simple test which writes data out and I set it
going using this

> loop, which generates slurm submission scripts for me and passes a
ton of

> options to my test. So the scripts run jobs on all node counts and

> procspercore count from 1-64. Since the machine is not yet in
production, I

> can get a lot of this done now.

>

> for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do

> for NPERNODE in 1 2 4 8 16 32 64

> do

> write_script (...options)

> done

> done

>

> cmake - yes, I'm also compiling with clang, I'm not trying to make
anything

> easy for myself here

>

> JB

>

> > -----Original Message-----

> > From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On

> > Behalf Of Rob Latham

> > Sent: 20 September 2013 17:03

> > To: HDF Users Discussion List

> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > single

> > (shared) file

> >

> > On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

> > > This morning, I did some poking around and found that the cmake

> > > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to

> > > be set to false and no GPFS optimizations are compiled in (libgpfs

> > > is not detected). Having tweaked that, you can imagine my happiness

> > > when I recompiled everything and now I'm getting even worse

> Bandwidth.

> >

> > Thanks for the report on those hints. HDF5 contains, outside of

> > gpfs-specific benchmarks, one of the few implementations of all the

> > gpfs_fcntl() tuning parameters. Given your experience, probably best

> > to turn off those hints.

> >

> > Also, cmake works on bluegene? wow. Don't forget that bluegene

> > requires cross compliation.

> >

> > > In fact if I enable collective IO, the app coredumps on me, so the

> > > situations is worse than I had feared. I'm using too much memory in

> > > my test I suspect and collectives are pushing me over the limit. The

> > > only test I can run with collective enabled is the one that uses

> > > only one rank and writes 16MB!

> >

> > How many processes per node are you using on your BGQ? if you are

> > loading up with 64 procs per node, that will give each one about

> > 200-230 MiB of scratch space.

> >

> > I wonder if you have built some or all of your hdf5 library for the

> > front end nodes, and some or none for the compute nodes?

> >

> > How many processes are you running here?

> >

> > A month back I ran some one-rack experiments:

> >

> https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar

> > e.png

> >

> > Here's my IOR config file. Note two tuning parameters here:

> > - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low

> > for Blue Gene /Q

> > - the 'bglockless' prefix is "robl's secret turbo button". it was fun

> > to pull that rabbit out of the hat... for the first few years.

> > (it's not the default because in one specific case performance is

> > shockingly poor).

> >

> > IOR START

> > numTasks=65536

> > repetitions=3

> > reorderTasksConstant=1024

> > fsync=1

> > transferSize=6M

> > blockSize=6M

> > collective=1

> > showHints=1

> > hintsFileName=IOR-hints-bg_nodes_pset.64

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.mpi

> > api=MPIIO

> > RUN

> > api=HDF5

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.h5

> > RUN

> > api=NCMPI

> >

> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> > api.nc

> > RUN

> > IOR STOP

> >

> >

> > > Rob : you mentioned some fcntl functions were deprecated etc. do I

> > > need to remove these to stop the coredumps? (I'm very much hoping

> > > something has gone wrong with my tests because the performance is

> > > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

> >

> > Unless you are running BGQ system software driver V1R2M1, the

> > gpfs_fcntl hints do not get forwarded to storage, and return an error.

> > It's possible HDF5 responds to that error with a core dump?

> >

> > ==rob

> >

> >

> > > JB

> > >

> > > > -----Original Message----- From: Hdf-forum

> > > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel

> > > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List

> > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > > > single (shared) file

> > > >

> > > > Rob,

> > > >

> > > > thanks a lot for hints. I will look at the suggested option and

> > > > try some experiments with it :).

> > > >

> > > > Daniel

> > > >

> > > >

> > > >

> > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

> > > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

> > > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]

> > > > >> single file, best result: 17.2 [s]

> > > > >>

> > > > >> (I did multiple runs with various combinations of strip count

> > > > >> and size, presenting the best results I have obtained.)

> > > > >>

> > > > >> Increasing the number of stripes obviously helped a lot, but

> > > > >> comparing with the separate-files strategy, the writing time is

> > > > >> still more than ten times slower . Do you think it is "normal"?

> > > > >

> > > > > It might be "normal" for Lustre, but it's not good. I wish I

> > > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but
I do

> not.

> > > > > The ADIOS folks report tuned-HDF5 to a single shared file runs

> > > > > about 60% slower than ADIOS to multiple files, not 10x slower,

> > > > > so it seems there is room for improvement.

> > > > >

> > > > > I've asked them about the kinds of things "tuned HDF5" entails,

> > > > > and they didn't know (!).

> > > > >

> > > > > There are quite a few settings documented in the intro_mpi(3)

> > > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most

> > > > > important thing you can try. I'm sorry to report that in my

> > > > > limited experience, the documentation and reality are sometimes

> > > > > out of sync, especially with respect to which settings are

> > > > > default or not.

> > > > >

> > > > > ==rob

> > > > >

> > > > >> Thanks, Daniel

> > > > >>

> > > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> > > > >>> I've run some benchmark, where within an MPI program, each

> > > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.

> > > > >>> I've used the following writing strategies:

> > > > >>>

> > > > >>> 1) each process writes to its own file, 2) each process writes

> > > > >>> to the same file to its own dataset, 3) each process writes to

> > > > >>> the same file to a same dataset.

> > > > >>>

> > > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size

> > > > >>> 1024), and I've tested 2)-3) for both independent/collective

> > > > >>> options of the MPI driver. I've also used 3 different clusters

> > > > >>> for measurements (all quite modern).

> > > > >>>

> > > > >>> As a result, the running (storage) times of the same-file

> > > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer

> > > > >>> than the running times of the separate-files strategy. For

> > > > >>> illustration:

> > > > >>>

> > > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of

> > > > >>> data, fixed data sets:

> > > > >>>

> > > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,

> > > > >>> separate data sets: 88.54[s]

> > > > >>>

> > > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of

> > > > >>> data, chunked data sets (chunk size 1024):

> > > > >>>

> > > > >>> 1) separate files: 10.40 [s] 2) single file, independent

> > > > >>> calls, shared data sets: 295 [s] 3) single file, collective

> > > > >>> calls, shared data sets: 3275 [s]

> > > > >>>

> > > > >>> Any idea why the single-file strategy gives so poor writing

> > > > >>> performance?

> > > > >>>

> > > > >>> Daniel

> > > > >>

> > > > >> _______________________________________________ Hdf-

> > forum is for

> > > > >> HDF software users discussion.

> > > > >> Hdf-forum@lists.hdfgroup.org
<mailto:Hdf-forum@lists.hdfgroup.org>

> > > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists

> > > > >> .h

> > > > >> dfgr

> > > > >> oup.org

> > > > >

> > > >

> > > > _______________________________________________ Hdf-

> forum

> > is for HDF

> > > > software users discussion. Hdf-forum@lists.hdfgroup.org
<mailto:Hdf-forum@lists.hdfgroup.org>

> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd

> > > > fg

> > > > roup.org

> > >

> > > _______________________________________________ Hdf-forum

> is

> > for HDF

> > > software users discussion. Hdf-forum@lists.hdfgroup.org
<mailto:Hdf-forum@lists.hdfgroup.org>

> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg

> > > ro

> > > up.org

> >

> > --

> > Rob Latham

> > Mathematics and Computer Science Division Argonne National Lab, IL USA

> >

> > _______________________________________________

> > Hdf-forum is for HDF software users discussion.

> > Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>

> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro

> > up.org

>

> _______________________________________________

> Hdf-forum is for HDF software users discussion.

> Hdf-forum@lists.hdfgroup.org <mailto:Hdf-forum@lists.hdfgroup.org>

>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

robl · September 19, 2013, 2:15pm

I have been following this thread with interest since we have the same issue in the synchrotron community, with new detectors generating 100's-1000's of 2D frames/sec and total rates approaching 10 GB/sec using multiple parallel 10 GbE streams from different detector nodes. What we have found is:

- Lustre is better at managing the pHDF5 contention between nodes than GPFS is.
- GPFS is better at streaming data from one node, if there is no contention.
- Having the nodes write to separate files is better than using pHDF5 to enable all nodes to write to one.

I would wager a tasty beverage or a box of donuts that the reason you
seen poor performance with GPFS to a shared file is because your
writes are not aligned to file system block boundaries. On large HPC
systems, the MPI-IO layer will often take care of that file system
block boundary alignment for you -- *if* you turn on collective I/O.

If you are using independent POSIX i/o then there won't be much HDF5
or MPI-IO can do to help you out.

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

An additional benefit is the data can be compressed, so data rates may be able to be reduced drastically by compression, depending on your situation.

You're proposing something akin to ADIOS, except the interface
continues to be the community-standard HDF5. how interesting!

This approach will make it impossible to benefit from several
collective MPI-I/O optimizations, but it does open the door to another
family of optimizations (one would likely trawl the many ADIOS
publications for ideas).

==rob

···

On Thu, Sep 19, 2013 at 08:43:48AM +0000, nick.rees@diamond.ac.uk wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Nathanael_Huebbe · September 20, 2013, 11:33am

As a matter of fact, this is pretty much what we did already for our own
research: We, too, patched the HDF5 library to provide writing of
multiple files and reading them back in a way entirely transparent to
the application. You can find our patch, along with a much more detailed
description, on our website:
http://www.wr.informatik.uni-hamburg.de/research/projects/icomex/multifilehdf5

On our system, we could actually see an improvement in wall-clock time
for the entire process of writing-reconstructing-reading as opposed to
writing to a shared file and reading it single stream. This may be
different on other systems, but at least we expect a huge benefit in
CPU-time since the multifile approach allows the parallel part of the
workflow to be fast.

Of course, we are very interested to hear about other people's
experiences with transparent multifiles.

Cheers,
Nathanael Hübbe

···

On 09/19/2013 10:43 AM, nick.rees@diamond.ac.uk wrote:

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

robl · September 27, 2013, 8:45pm

Collective I/O will do a few things that can help:

- aggregate i/o accesses down to a smaller number of "i/o
  aggregators". This happens at the MPI-IO layer before the I/O nodes
  are involved, but IBM has optimized this process for Blue Gene such
  that I/O nodes are elected as i/o aggregators based on their
  relationship to I/o forwarding nodes. These aggregators then will
  make fewer I/O requests, and typically larger ones at that.

- align accesses. let's say each process does 1000000 byte writes,
  but your GPFS file system has a 4 MiB (i.e. not 4000000 but rather
  4194304 bytes) block size. GPFS does really well when accesses are
  aligned to a multiple of the block size. The MPI-IO layer will
  shuffle accesses around a bit, with the end result that aggregators
  will for the most part do their i/o to one or more non-shared GPFS
  file system blocks.

Independent access just goes to the file system. There's no way to
optimize them.

···

On Fri, Sep 27, 2013 at 07:21:58PM +0000, Biddiscombe, John A. wrote:

I am also a little puzzled the shape of the graphs later as independent drops and collective overtakes it. I presume the effect of latency on many writes to IONs is causing performance to drop whilst the collective mode avoids some of this.
This page of graphs is one of many for different file system configs and they all show the same pattern to greater or lesser degree. I shall do more tests I expect until I am happy that I fully understand what's going on.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

John_Biddiscombe · October 15, 2013, 8:55am

Nathanael

sorry for a very late reply. I had couple of weeks away.

For this set of tests, each process generated 16MB of data so the file would be 16MB on 1 core, 1 node, and rise to 16*32*4096MB = 2TB on 32 cores of 4096 nodes etc etc. I can't remember the largest on this config.

you can think of this as a kind of weak scaling. but it's not so easy to strong scale when memory per node is limited. I have another test which allocates 2GB per node and then writes it out from 1 core, 2 cores etc etc and does this for 1 node up to 4096. I was going to run this to see if it made any real difference, but it ought to be quite similar except maybe slightly better for the low core/node counts. My scratch filesystem has been wiuped so I've lost all my tests and will have to recompile everything.

JB

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of huebbe
Sent: 01 October 2013 12:58
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Hello John,
these graphs really look interesting. But I have one question remaining: How much data did every process write in one go? Was this datasize fixed for a single process? Or was the total amount of data written in each round fixed? I. e. did you use weak or strong scaling?

Cheers,
Nathanael Hübbe

On 09/27/2013 04:58 PM, Biddiscombe, John A. wrote:

Rob,

Over the last couple of days, I've been able to rerun tests (here using h5perf) with the bglockless flag

export BGLOCKLESSMPIO_F_TYPE=0x47504653

and the results are greatly improved. Attached is one page of plots where we get up to 30GB/s which compares to just over 40 with IOR, so in the right range compared to expectations.

The difference that one flag can make is quite impressive. People need to know this!

Thanks

JB

[cid:image001.jpg@01CEC995.12299C30]

-----Original Message-----

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf

Of Biddiscombe, John A.

Sent: 20 September 2013 21:47

To: HDF Users Discussion List

Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single

(shared) file

Rob

Thanks for the info regarding settings and IOR config etc I wil go through that

in detail over the next few days.

I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this

regard are little better than printf and I'm going to need to do some profiling

and stepping through code to see what's going on inside hdf5.

Just FYI. I run a simple test which writes data out and I set it going using this

loop, which generates slurm submission scripts for me and passes a ton of

options to my test. So the scripts run jobs on all node counts and

procspercore count from 1-64. Since the machine is not yet in production, I

can get a lot of this done now.

for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do

for NPERNODE in 1 2 4 8 16 32 64

do

write_script (...options)

done

cmake - yes, I'm also compiling with clang, I'm not trying to make anything

easy for myself here

JB

> -----Original Message-----

> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On

> Behalf Of Rob Latham

> Sent: 20 September 2013 17:03

> To: HDF Users Discussion List

> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> single

> (shared) file

>

> On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

> > This morning, I did some poking around and found that the cmake

> > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to

> > be set to false and no GPFS optimizations are compiled in (libgpfs

> > is not detected). Having tweaked that, you can imagine my happiness

> > when I recompiled everything and now I'm getting even worse

Bandwidth.

>

> Thanks for the report on those hints. HDF5 contains, outside of

> gpfs-specific benchmarks, one of the few implementations of all the

> gpfs_fcntl() tuning parameters. Given your experience, probably best

> to turn off those hints.

>

> Also, cmake works on bluegene? wow. Don't forget that bluegene

> requires cross compliation.

>

> > In fact if I enable collective IO, the app coredumps on me, so the

> > situations is worse than I had feared. I'm using too much memory in

> > my test I suspect and collectives are pushing me over the limit. The

> > only test I can run with collective enabled is the one that uses

> > only one rank and writes 16MB!

>

> How many processes per node are you using on your BGQ? if you are

> loading up with 64 procs per node, that will give each one about

> 200-230 MiB of scratch space.

>

> I wonder if you have built some or all of your hdf5 library for the

> front end nodes, and some or none for the compute nodes?

>

> How many processes are you running here?

>

> A month back I ran some one-rack experiments:

>

Dropbox - mira_hinted_api_compare.png - Simplify your life

> e.png

>

> Here's my IOR config file. Note two tuning parameters here:

> - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low

> for Blue Gene /Q

> - the 'bglockless' prefix is "robl's secret turbo button". it was fun

> to pull that rabbit out of the hat... for the first few years.

> (it's not the default because in one specific case performance is

> shockingly poor).

>

> IOR START

> numTasks=65536

> repetitions=3

> reorderTasksConstant=1024

> fsync=1

> transferSize=6M

> blockSize=6M

> collective=1

> showHints=1

> hintsFileName=IOR-hints-bg_nodes_pset.64

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.mpi

> api=MPIIO

> RUN

> api=HDF5

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.h5

> RUN

> api=NCMPI

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.nc

> RUN

> IOR STOP

>

> > Rob : you mentioned some fcntl functions were deprecated etc. do I

> > need to remove these to stop the coredumps? (I'm very much hoping

> > something has gone wrong with my tests because the performance is

> > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

>

> Unless you are running BGQ system software driver V1R2M1, the

> gpfs_fcntl hints do not get forwarded to storage, and return an error.

> It's possible HDF5 responds to that error with a core dump?

>

> ==rob

>

> > JB

> >

> > > -----Original Message----- From: Hdf-forum

> > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel

> > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List

> > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > > single (shared) file

> > >

> > > Rob,

> > >

> > > thanks a lot for hints. I will look at the suggested option and

> > > try some experiments with it :).

> > >

> > > Daniel

> > >

> > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

> > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

> > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]

> > > >> single file, best result: 17.2 [s]

> > > >>

> > > >> (I did multiple runs with various combinations of strip count

> > > >> and size, presenting the best results I have obtained.)

> > > >>

> > > >> Increasing the number of stripes obviously helped a lot, but

> > > >> comparing with the separate-files strategy, the writing time is

> > > >> still more than ten times slower . Do you think it is "normal"?

> > > >

> > > > It might be "normal" for Lustre, but it's not good. I wish I

> > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do

not.

> > > > The ADIOS folks report tuned-HDF5 to a single shared file runs

> > > > about 60% slower than ADIOS to multiple files, not 10x slower,

> > > > so it seems there is room for improvement.

> > > >

> > > > I've asked them about the kinds of things "tuned HDF5" entails,

> > > > and they didn't know (!).

> > > >

> > > > There are quite a few settings documented in the intro_mpi(3)

> > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most

> > > > important thing you can try. I'm sorry to report that in my

> > > > limited experience, the documentation and reality are sometimes

> > > > out of sync, especially with respect to which settings are

> > > > default or not.

> > > >

> > > > ==rob

> > > >

> > > >> Thanks, Daniel

> > > >>

> > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> > > >>> I've run some benchmark, where within an MPI program, each

> > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.

> > > >>> I've used the following writing strategies:

> > > >>>

> > > >>> 1) each process writes to its own file, 2) each process writes

> > > >>> to the same file to its own dataset, 3) each process writes to

> > > >>> the same file to a same dataset.

> > > >>>

> > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size

> > > >>> 1024), and I've tested 2)-3) for both independent/collective

> > > >>> options of the MPI driver. I've also used 3 different clusters

> > > >>> for measurements (all quite modern).

> > > >>>

> > > >>> As a result, the running (storage) times of the same-file

> > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer

> > > >>> than the running times of the separate-files strategy. For

> > > >>> illustration:

> > > >>>

> > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of

> > > >>> data, fixed data sets:

> > > >>>

> > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,

> > > >>> separate data sets: 88.54[s]

> > > >>>

> > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of

> > > >>> data, chunked data sets (chunk size 1024):

> > > >>>

> > > >>> 1) separate files: 10.40 [s] 2) single file, independent

> > > >>> calls, shared data sets: 295 [s] 3) single file, collective

> > > >>> calls, shared data sets: 3275 [s]

> > > >>>

> > > >>> Any idea why the single-file strategy gives so poor writing

> > > >>> performance?

> > > >>>

> > > >>> Daniel

> > > >>

> > > >> _______________________________________________ Hdf-

> forum is for

> > > >> HDF software users discussion.

> > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists

> > > >> .h

> > > >> dfgr

> > > >> oup.org

> > > >

> > >

> > > _______________________________________________ Hdf-

forum

> is for HDF

> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd

> > > fg

> > > roup.org

> >

> > _______________________________________________ Hdf-forum

is

> for HDF

> > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg

> > ro

> > up.org

>

> --

> Rob Latham

> Mathematics and Computer Science Division Argonne National Lab, IL USA

>

> _______________________________________________

> Hdf-forum is for HDF software users discussion.

> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro

> up.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Nick_Rees · September 20, 2013, 11:20am

Hi Rob,

I think the blocks should be aligned. The detector blocks are exactly 1MB in size and we chunk 4 of them together in the layout and set the alignment to be 4 MB to correspond to the GPFS block size. If we don't do this performance is dire. Do we have to do anything else?

Having said this, our Lustre experience is larger than our GPFS experience, so any ideas are welcome. We have tried asking support but we haven't had many ideas from DDN either. We do know that with simple IOR tests GPFS is great, but not with our real data at the moment.

Thanks for the ADIOS pointers. Where is a good link to start from to find out about it?

Cheers,

Nick Rees
Principal Software Engineer Phone: +44 (0)1235-778430
Diamond Light Source Fax: +44 (0)1235-446713

···

-------- Original message --------
From: Rob Latham <robl@mcs.anl.gov>
Date:
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

On Thu, Sep 19, 2013 at 08:43:48AM +0000, nick.rees@diamond.ac.uk wrote:

I have been following this thread with interest since we have the same issue in the synchrotron community, with new detectors generating 100's-1000's of 2D frames/sec and total rates approaching 10 GB/sec using multiple parallel 10 GbE streams from different detector nodes. What we have found is:

- Lustre is better at managing the pHDF5 contention between nodes than GPFS is.
- GPFS is better at streaming data from one node, if there is no contention.
- Having the nodes write to separate files is better than using pHDF5 to enable all nodes to write to one.

I would wager a tasty beverage or a box of donuts that the reason you
seen poor performance with GPFS to a shared file is because your
writes are not aligned to file system block boundaries. On large HPC
systems, the MPI-IO layer will often take care of that file system
block boundary alignment for you -- *if* you turn on collective I/O.

If you are using independent POSIX i/o then there won't be much HDF5
or MPI-IO can do to help you out.

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

An additional benefit is the data can be compressed, so data rates may be able to be reduced drastically by compression, depending on your situation.

You're proposing something akin to ADIOS, except the interface
continues to be the community-standard HDF5. how interesting!

This approach will make it impossible to benefit from several
collective MPI-I/O optimizations, but it does open the door to another
family of optimizations (one would likely trawl the many ADIOS
publications for ideas).

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Mohamad_Chaarawi · September 20, 2013, 2:01pm

Hi Nathanael,

I'll try and spend some time looking at the patch. Thanks for sharing!

This sounds like you are optimizing your checkpointing phase.
Is there any advantage from doing this rather than using PLFS?

Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of huebbe
Sent: Friday, September 20, 2013 6:34 AM
To: hdf-forum@lists.hdfgroup.org
Cc: Julian Kunkel
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

On 09/19/2013 10:43 AM, nick.rees@diamond.ac.uk wrote:

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

As a matter of fact, this is pretty much what we did already for our own
research: We, too, patched the HDF5 library to provide writing of multiple files and reading them back in a way entirely transparent to the application. You can find our patch, along with a much more detailed description, on our website:
http://www.wr.informatik.uni-hamburg.de/research/projects/icomex/multifilehdf5

On our system, we could actually see an improvement in wall-clock time for the entire process of writing-reconstructing-reading as opposed to writing to a shared file and reading it single stream. This may be different on other systems, but at least we expect a huge benefit in CPU-time since the multifile approach allows the parallel part of the workflow to be fast.

Of course, we are very interested to hear about other people's experiences with transparent multifiles.

Cheers,
Nathanael Hübbe

John_Biddiscombe · October 21, 2013, 7:52pm

Just replying to myself with an update for general info...

One of the GSS GPFS servers was not performing as expected and after some maintenance on the system, I have rerun the h5perf benchmarks and I now get a peak of 41GB/s - which is more like 90% of IOR results (though I am not sure what the new peak for IOR is to be honest as kit might be higher, but we have 5 GSS with 8.x GB/s each, so anything over 40 is close to maxing out the system).

JB

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Biddiscombe, John A.
Sent: 15 October 2013 10:56
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Nathanael

sorry for a very late reply. I had couple of weeks away.

For this set of tests, each process generated 16MB of data so the file would be 16MB on 1 core, 1 node, and rise to 16*32*4096MB = 2TB on 32 cores of 4096 nodes etc etc. I can't remember the largest on this config.

you can think of this as a kind of weak scaling. but it's not so easy to strong scale when memory per node is limited. I have another test which allocates 2GB per node and then writes it out from 1 core, 2 cores etc etc and does this for 1 node up to 4096. I was going to run this to see if it made any real difference, but it ought to be quite similar except maybe slightly better for the low core/node counts. My scratch filesystem has been wiuped so I've lost all my tests and will have to recompile everything.

JB

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of huebbe
Sent: 01 October 2013 12:58
To: hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Hello John,
these graphs really look interesting. But I have one question remaining: How much data did every process write in one go? Was this datasize fixed for a single process? Or was the total amount of data written in each round fixed? I. e. did you use weak or strong scaling?

Cheers,
Nathanael Hübbe

On 09/27/2013 04:58 PM, Biddiscombe, John A. wrote:

Rob,

Over the last couple of days, I've been able to rerun tests (here using h5perf) with the bglockless flag

export BGLOCKLESSMPIO_F_TYPE=0x47504653

and the results are greatly improved. Attached is one page of plots where we get up to 30GB/s which compares to just over 40 with IOR, so in the right range compared to expectations.

The difference that one flag can make is quite impressive. People need to know this!

Thanks

JB

[cid:image001.jpg@01CECEA7.D7E989F0]

-----Original Message-----

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf

Of Biddiscombe, John A.

Sent: 20 September 2013 21:47

To: HDF Users Discussion List

Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single

(shared) file

Rob

Thanks for the info regarding settings and IOR config etc I wil go through that

in detail over the next few days.

I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this

regard are little better than printf and I'm going to need to do some profiling

and stepping through code to see what's going on inside hdf5.

Just FYI. I run a simple test which writes data out and I set it going using this

loop, which generates slurm submission scripts for me and passes a ton of

options to my test. So the scripts run jobs on all node counts and

procspercore count from 1-64. Since the machine is not yet in production, I

can get a lot of this done now.

for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do

for NPERNODE in 1 2 4 8 16 32 64

do

write_script (...options)

done

cmake - yes, I'm also compiling with clang, I'm not trying to make anything

easy for myself here

JB

> -----Original Message-----

> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On

> Behalf Of Rob Latham

> Sent: 20 September 2013 17:03

> To: HDF Users Discussion List

> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> single

> (shared) file

>

> On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:

> > This morning, I did some poking around and found that the cmake

> > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to

> > be set to false and no GPFS optimizations are compiled in (libgpfs

> > is not detected). Having tweaked that, you can imagine my happiness

> > when I recompiled everything and now I'm getting even worse

Bandwidth.

>

> Thanks for the report on those hints. HDF5 contains, outside of

> gpfs-specific benchmarks, one of the few implementations of all the

> gpfs_fcntl() tuning parameters. Given your experience, probably best

> to turn off those hints.

>

> Also, cmake works on bluegene? wow. Don't forget that bluegene

> requires cross compliation.

>

> > In fact if I enable collective IO, the app coredumps on me, so the

> > situations is worse than I had feared. I'm using too much memory in

> > my test I suspect and collectives are pushing me over the limit. The

> > only test I can run with collective enabled is the one that uses

> > only one rank and writes 16MB!

>

> How many processes per node are you using on your BGQ? if you are

> loading up with 64 procs per node, that will give each one about

> 200-230 MiB of scratch space.

>

> I wonder if you have built some or all of your hdf5 library for the

> front end nodes, and some or none for the compute nodes?

>

> How many processes are you running here?

>

> A month back I ran some one-rack experiments:

>

Dropbox - mira_hinted_api_compare.png - Simplify your life

> e.png

>

> Here's my IOR config file. Note two tuning parameters here:

> - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low

> for Blue Gene /Q

> - the 'bglockless' prefix is "robl's secret turbo button". it was fun

> to pull that rabbit out of the hat... for the first few years.

> (it's not the default because in one specific case performance is

> shockingly poor).

>

> IOR START

> numTasks=65536

> repetitions=3

> reorderTasksConstant=1024

> fsync=1

> transferSize=6M

> blockSize=6M

> collective=1

> showHints=1

> hintsFileName=IOR-hints-bg_nodes_pset.64

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.mpi

> api=MPIIO

> RUN

> api=HDF5

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.h5

> RUN

> api=NCMPI

>

> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-

> api.nc

> RUN

> IOR STOP

>

> > Rob : you mentioned some fcntl functions were deprecated etc. do I

> > need to remove these to stop the coredumps? (I'm very much hoping

> > something has gone wrong with my tests because the performance is

> > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)

>

> Unless you are running BGQ system software driver V1R2M1, the

> gpfs_fcntl hints do not get forwarded to storage, and return an error.

> It's possible HDF5 responds to that error with a core dump?

>

> ==rob

>

> > JB

> >

> > > -----Original Message----- From: Hdf-forum

> > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel

> > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List

> > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using

> > > single (shared) file

> > >

> > > Rob,

> > >

> > > thanks a lot for hints. I will look at the suggested option and

> > > try some experiments with it :).

> > >

> > > Daniel

> > >

> > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):

> > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:

> > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]

> > > >> single file, best result: 17.2 [s]

> > > >>

> > > >> (I did multiple runs with various combinations of strip count

> > > >> and size, presenting the best results I have obtained.)

> > > >>

> > > >> Increasing the number of stripes obviously helped a lot, but

> > > >> comparing with the separate-files strategy, the writing time is

> > > >> still more than ten times slower . Do you think it is "normal"?

> > > >

> > > > It might be "normal" for Lustre, but it's not good. I wish I

> > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do

not.

> > > > The ADIOS folks report tuned-HDF5 to a single shared file runs

> > > > about 60% slower than ADIOS to multiple files, not 10x slower,

> > > > so it seems there is room for improvement.

> > > >

> > > > I've asked them about the kinds of things "tuned HDF5" entails,

> > > > and they didn't know (!).

> > > >

> > > > There are quite a few settings documented in the intro_mpi(3)

> > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most

> > > > important thing you can try. I'm sorry to report that in my

> > > > limited experience, the documentation and reality are sometimes

> > > > out of sync, especially with respect to which settings are

> > > > default or not.

> > > >

> > > > ==rob

> > > >

> > > >> Thanks, Daniel

> > > >>

> > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):

> > > >>> I've run some benchmark, where within an MPI program, each

> > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.

> > > >>> I've used the following writing strategies:

> > > >>>

> > > >>> 1) each process writes to its own file, 2) each process writes

> > > >>> to the same file to its own dataset, 3) each process writes to

> > > >>> the same file to a same dataset.

> > > >>>

> > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size

> > > >>> 1024), and I've tested 2)-3) for both independent/collective

> > > >>> options of the MPI driver. I've also used 3 different clusters

> > > >>> for measurements (all quite modern).

> > > >>>

> > > >>> As a result, the running (storage) times of the same-file

> > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer

> > > >>> than the running times of the separate-files strategy. For

> > > >>> illustration:

> > > >>>

> > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of

> > > >>> data, fixed data sets:

> > > >>>

> > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,

> > > >>> separate data sets: 88.54[s]

> > > >>>

> > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of

> > > >>> data, chunked data sets (chunk size 1024):

> > > >>>

> > > >>> 1) separate files: 10.40 [s] 2) single file, independent

> > > >>> calls, shared data sets: 295 [s] 3) single file, collective

> > > >>> calls, shared data sets: 3275 [s]

> > > >>>

> > > >>> Any idea why the single-file strategy gives so poor writing

> > > >>> performance?

> > > >>>

> > > >>> Daniel

> > > >>

> > > >> _______________________________________________ Hdf-

> forum is for

> > > >> HDF software users discussion.

> > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists

> > > >> .h

> > > >> dfgr

> > > >> oup.org

> > > >

> > >

> > > _______________________________________________ Hdf-

forum

> is for HDF

> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd

> > > fg

> > > roup.org

> >

> > _______________________________________________ Hdf-forum

is

> for HDF

> > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg

> > ro

> > up.org

>

> --

> Rob Latham

> Mathematics and Computer Science Division Argonne National Lab, IL USA

>

> _______________________________________________

> Hdf-forum is for HDF software users discussion.

> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro

> up.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

matthieu.brucher · September 20, 2013, 2:38pm

Hi,

I also saw something similar using one file versus several. In the
end, I changed the writing stage completely. Instead of splitting the
file in n segments (n being the number of node I was using), I now
split it in segments of 1MB (my FS stripe count) and gather on "stripe
count" processes the needed parts. Then whenever a segment is ready to
be written, the corresponding process writes it at the proper position
with a non collective call.
This way, I am sure that the "chunk" being written has the proper size
(i.e. = stripe size).
My only worry now is that on a parallel FS, it could return more or
less immediately. At the moment, it seems to flush whatever was
written before returning the call. I guess that on a parallel
filesystem, this could be avoided, especially if you are writing non
overlapping chunk (byte-level locking, so no worry to have?).

Cheers,

Matthieu Brucher

···

2013/9/20 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Hi Nathanael,

I'll try and spend some time looking at the patch. Thanks for sharing!

This sounds like you are optimizing your checkpointing phase.
Is there any advantage from doing this rather than using PLFS?

Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of huebbe
Sent: Friday, September 20, 2013 6:34 AM
To: hdf-forum@lists.hdfgroup.org
Cc: Julian Kunkel
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

On 09/19/2013 10:43 AM, nick.rees@diamond.ac.uk wrote:

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

As a matter of fact, this is pretty much what we did already for our own
research: We, too, patched the HDF5 library to provide writing of multiple files and reading them back in a way entirely transparent to the application. You can find our patch, along with a much more detailed description, on our website:
http://www.wr.informatik.uni-hamburg.de/research/projects/icomex/multifilehdf5

On our system, we could actually see an improvement in wall-clock time for the entire process of writing-reconstructing-reading as opposed to writing to a shared file and reading it single stream. This may be different on other systems, but at least we expect a huge benefit in CPU-time since the multifile approach allows the parallel part of the workflow to be fast.

Of course, we are very interested to hear about other people's experiences with transparent multifiles.

Cheers,
Nathanael Hübbe

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

juliankunkel · September 20, 2013, 2:46pm

Hi Mohamad,
thanks for your quick response.

Aim of our modifications were to demonstrate optimal parallel write
performance, e.g. as achieved by IOR / POSIX, is possible. We could
use it not only for checkpointing but for all periodic output of the
climate models.
Therefore, we chose to realize a lightweight and easy solution.

Currently, on our parallel file system, performance degradation of
parallel I/O and some high-level libraries leads to the absurd
situation in which scientists use sequential I/O for performance
reasons.
We have not compared performance to PLFS due to time restrictions in
the project and we really believe a general solution, i.e. fixing the
underlying file systems is important.
I think indeed the PLFS approach is a good general solution.

There are differences and minor drawbacks of using PLFS compared to
HDF5/multifile.
1) PLFS requires to deal with an extra mountpoint which must be
managed by someone. This modification here just requires to select the
driver (it could be also a drop-in replacement of the existing driver
for a library, thus no modifications would be needed).
2) Due to FUSE, PLFS adds overhead compared to direct operations,
which is not the case for the multifile approach. The post-mortem
conversion takes time, but the conversion can be done by only one
thread reducing required CPU resources drastically.
3) Due to the involvement of VFS and POSIX, the access pattern given
by PLFS to the parallel file system underneath may be suboptimal. Our
file system is very sensible to the access pattern, therefore we
decided to have full control of the pattern without rewriting I/O
handling in more complex software.
4) Once the file has been accessed for reading once it (should be)
bit-identical with a regular HDF5 file thus it can be used without any
library / file system modifications...
5) Finally a very important one: FUSE does not work under AIX

If you are interested we could continue investigating the results on
another platform.

Thanks & Best regards,
julian

···

2013/9/20 Mohamad Chaarawi <chaarawi@hdfgroup.org>:

Hi Nathanael,

I'll try and spend some time looking at the patch. Thanks for sharing!

This sounds like you are optimizing your checkpointing phase.
Is there any advantage from doing this rather than using PLFS?

Mohamad

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of huebbe
Sent: Friday, September 20, 2013 6:34 AM
To: hdf-forum@lists.hdfgroup.org
Cc: Julian Kunkel
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

On 09/19/2013 10:43 AM, nick.rees@diamond.ac.uk wrote:

What we are doing is working with The HDF Group to define a work package dubbed "Virtual Datasets" where you can have a virtual dataset in a master file which is composed of datasets in underlying files. It is a bit like extending the soft-link mechanism to allow unions. The method of mapping the underlying datasets onto the virtual dataset is very flexible and so we hope it can be used in a number of circumstances. The two main requirements are:

- The use of the virtual dataset is transparent to any program reading the data later.
- The writing nodes can write their files independently, so don't need pHDF5.

As a matter of fact, this is pretty much what we did already for our own
research: We, too, patched the HDF5 library to provide writing of multiple files and reading them back in a way entirely transparent to the application. You can find our patch, along with a much more detailed description, on our website:
http://www.wr.informatik.uni-hamburg.de/research/projects/icomex/multifilehdf5

On our system, we could actually see an improvement in wall-clock time for the entire process of writing-reconstructing-reading as opposed to writing to a shared file and reading it single stream. This may be different on other systems, but at least we expect a huge benefit in CPU-time since the multifile approach allows the parallel part of the workflow to be fast.

Of course, we are very interested to hear about other people's experiences with transparent multifiles.

Cheers,
Nathanael Hübbe

--
http://wr.informatik.uni-hamburg.de/people/julian_kunkel

robl · September 20, 2013, 3:20pm

Or bluegene (well, you could run ZeptoOS I suppose...)

==rob

···

On Fri, Sep 20, 2013 at 04:46:47PM +0200, Julian Kunkel wrote:

5) Finally a very important one: FUSE does not work under AIX

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Very poor performance of pHDF5 when using single (shared) file