Over the last couple of days, I've been able to rerun tests (here using h5perf) with the bglockless flag
and the results are greatly improved. Attached is one page of plots where we get up to 30GB/s which compares to just over 40 with IOR, so in the right range compared to expectations.
The difference that one flag can make is quite impressive. People need to know this!
···
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf
Of Biddiscombe, John A.
Sent: 20 September 2013 21:47
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
(shared) file
Rob
Thanks for the info regarding settings and IOR config etc I wil go through that
in detail over the next few days.
I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this
regard are little better than printf and I'm going to need to do some profiling
and stepping through code to see what's going on inside hdf5.
Just FYI. I run a simple test which writes data out and I set it going using this
loop, which generates slurm submission scripts for me and passes a ton of
options to my test. So the scripts run jobs on all node counts and
procspercore count from 1-64. Since the machine is not yet in production, I
can get a lot of this done now.
for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do
for NPERNODE in 1 2 4 8 16 32 64
do
write_script (...options)
done
done
cmake - yes, I'm also compiling with clang, I'm not trying to make anything
easy for myself here
JB
> -----Original Message-----
> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On
> Behalf Of Rob Latham
> Sent: 20 September 2013 17:03
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> single
> (shared) file
>
> On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:
> > This morning, I did some poking around and found that the cmake
> > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to
> > be set to false and no GPFS optimizations are compiled in (libgpfs
> > is not detected). Having tweaked that, you can imagine my happiness
> > when I recompiled everything and now I'm getting even worse
Bandwidth.
>
> Thanks for the report on those hints. HDF5 contains, outside of
> gpfs-specific benchmarks, one of the few implementations of all the
> gpfs_fcntl() tuning parameters. Given your experience, probably best
> to turn off those hints.
>
> Also, cmake works on bluegene? wow. Don't forget that bluegene
> requires cross compliation.
>
> > In fact if I enable collective IO, the app coredumps on me, so the
> > situations is worse than I had feared. I'm using too much memory in
> > my test I suspect and collectives are pushing me over the limit. The
> > only test I can run with collective enabled is the one that uses
> > only one rank and writes 16MB!
>
> How many processes per node are you using on your BGQ? if you are
> loading up with 64 procs per node, that will give each one about
> 200-230 MiB of scratch space.
>
> I wonder if you have built some or all of your hdf5 library for the
> front end nodes, and some or none for the compute nodes?
>
> How many processes are you running here?
>
> A month back I ran some one-rack experiments:
>
Dropbox - mira_hinted_api_compare.png - Simplify your life
> e.png
>
> Here's my IOR config file. Note two tuning parameters here:
> - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
> for Blue Gene /Q
> - the 'bglockless' prefix is "robl's secret turbo button". it was fun
> to pull that rabbit out of the hat... for the first few years.
> (it's not the default because in one specific case performance is
> shockingly poor).
>
> IOR START
> numTasks=65536
> repetitions=3
> reorderTasksConstant=1024
> fsync=1
> transferSize=6M
> blockSize=6M
> collective=1
> showHints=1
> hintsFileName=IOR-hints-bg_nodes_pset.64
>
> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> api.mpi
> api=MPIIO
> RUN
> api=HDF5
>
> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> api.h5
> RUN
> api=NCMPI
>
> testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> api.nc
> RUN
> IOR STOP
>
>
> > Rob : you mentioned some fcntl functions were deprecated etc. do I
> > need to remove these to stop the coredumps? (I'm very much hoping
> > something has gone wrong with my tests because the performance is
> > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)
>
> Unless you are running BGQ system software driver V1R2M1, the
> gpfs_fcntl hints do not get forwarded to storage, and return an error.
> It's possible HDF5 responds to that error with a core dump?
>
> ==rob
>
>
> > JB
> >
> > > -----Original Message----- From: Hdf-forum
> > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel
> > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> > > single (shared) file
> > >
> > > Rob,
> > >
> > > thanks a lot for hints. I will look at the suggested option and
> > > try some experiments with it :).
> > >
> > > Daniel
> > >
> > >
> > >
> > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]
> > > >> single file, best result: 17.2 [s]
> > > >>
> > > >> (I did multiple runs with various combinations of strip count
> > > >> and size, presenting the best results I have obtained.)
> > > >>
> > > >> Increasing the number of stripes obviously helped a lot, but
> > > >> comparing with the separate-files strategy, the writing time is
> > > >> still more than ten times slower . Do you think it is "normal"?
> > > >
> > > > It might be "normal" for Lustre, but it's not good. I wish I
> > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do
not.
> > > > The ADIOS folks report tuned-HDF5 to a single shared file runs
> > > > about 60% slower than ADIOS to multiple files, not 10x slower,
> > > > so it seems there is room for improvement.
> > > >
> > > > I've asked them about the kinds of things "tuned HDF5" entails,
> > > > and they didn't know (!).
> > > >
> > > > There are quite a few settings documented in the intro_mpi(3)
> > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most
> > > > important thing you can try. I'm sorry to report that in my
> > > > limited experience, the documentation and reality are sometimes
> > > > out of sync, especially with respect to which settings are
> > > > default or not.
> > > >
> > > > ==rob
> > > >
> > > >> Thanks, Daniel
> > > >>
> > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > >>> I've run some benchmark, where within an MPI program, each
> > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> > > >>> I've used the following writing strategies:
> > > >>>
> > > >>> 1) each process writes to its own file, 2) each process writes
> > > >>> to the same file to its own dataset, 3) each process writes to
> > > >>> the same file to a same dataset.
> > > >>>
> > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> > > >>> 1024), and I've tested 2)-3) for both independent/collective
> > > >>> options of the MPI driver. I've also used 3 different clusters
> > > >>> for measurements (all quite modern).
> > > >>>
> > > >>> As a result, the running (storage) times of the same-file
> > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer
> > > >>> than the running times of the separate-files strategy. For
> > > >>> illustration:
> > > >>>
> > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> > > >>> data, fixed data sets:
> > > >>>
> > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> > > >>> separate data sets: 88.54[s]
> > > >>>
> > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> > > >>> data, chunked data sets (chunk size 1024):
> > > >>>
> > > >>> 1) separate files: 10.40 [s] 2) single file, independent
> > > >>> calls, shared data sets: 295 [s] 3) single file, collective
> > > >>> calls, shared data sets: 3275 [s]
> > > >>>
> > > >>> Any idea why the single-file strategy gives so poor writing
> > > >>> performance?
> > > >>>
> > > >>> Daniel
> > > >>
> > > >> _______________________________________________ Hdf-
> forum is for
> > > >> HDF software users discussion.
> > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists
> > > >> .h
> > > >> dfgr
> > > >> oup.org
> > > >
> > >
> > > _______________________________________________ Hdf-
forum
> is for HDF
> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd
> > > fg
> > > roup.org
> >
> > _______________________________________________ Hdf-forum
is
> for HDF
> > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg
> > ro
> > up.org
>
> --
> Rob Latham
> Mathematics and Computer Science Division Argonne National Lab, IL USA
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
> up.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org