The difference that one flag can make is quite impressive. People need to know this!
Oh John Oh John.... I cannot tell you how angry that flag makes me!
'bglockless:' was supposed to be a short-term hack. It was written
for the PVFS file system (which did not support fcntl()-style locks,
or any locks at all for that matter). Then we found out it helped
GPFS on BlueGene too.
I'm going to have to just sit down for a couple-five days and send IBM
a patch removing all the locks from the default driver, and telling
anyone who wants to run MPI-IO to an NFS file system on a blue gene to
take a hike.
Thanks for the graphs. I was surprised to see that fewer than 8 cores
per node resulted in slightly *worse* performance for collective I/O.
==rob
···
On Fri, Sep 27, 2013 at 02:58:32PM +0000, Biddiscombe, John A. wrote:
[cid:image001.jpg@01CEBBA2.CC085FC0]
> -----Original Message-----
> From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf
> Of Biddiscombe, John A.
> Sent: 20 September 2013 21:47
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
> (shared) file
>
> Rob
>
> Thanks for the info regarding settings and IOR config etc I wil go through that
> in detail over the next few days.
>
> I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this
> regard are little better than printf and I'm going to need to do some profiling
> and stepping through code to see what's going on inside hdf5.
>
> Just FYI. I run a simple test which writes data out and I set it going using this
> loop, which generates slurm submission scripts for me and passes a ton of
> options to my test. So the scripts run jobs on all node counts and
> procspercore count from 1-64. Since the machine is not yet in production, I
> can get a lot of this done now.
>
> for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do
> for NPERNODE in 1 2 4 8 16 32 64
> do
> write_script (...options)
> done
> done
>
> cmake - yes, I'm also compiling with clang, I'm not trying to make anything
> easy for myself here
>
> JB
>
> > -----Original Message-----
> > From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On
> > Behalf Of Rob Latham
> > Sent: 20 September 2013 17:03
> > To: HDF Users Discussion List
> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> > single
> > (shared) file
> >
> > On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:
> > > This morning, I did some poking around and found that the cmake
> > > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to
> > > be set to false and no GPFS optimizations are compiled in (libgpfs
> > > is not detected). Having tweaked that, you can imagine my happiness
> > > when I recompiled everything and now I'm getting even worse
> Bandwidth.
> >
> > Thanks for the report on those hints. HDF5 contains, outside of
> > gpfs-specific benchmarks, one of the few implementations of all the
> > gpfs_fcntl() tuning parameters. Given your experience, probably best
> > to turn off those hints.
> >
> > Also, cmake works on bluegene? wow. Don't forget that bluegene
> > requires cross compliation.
> >
> > > In fact if I enable collective IO, the app coredumps on me, so the
> > > situations is worse than I had feared. I'm using too much memory in
> > > my test I suspect and collectives are pushing me over the limit. The
> > > only test I can run with collective enabled is the one that uses
> > > only one rank and writes 16MB!
> >
> > How many processes per node are you using on your BGQ? if you are
> > loading up with 64 procs per node, that will give each one about
> > 200-230 MiB of scratch space.
> >
> > I wonder if you have built some or all of your hdf5 library for the
> > front end nodes, and some or none for the compute nodes?
> >
> > How many processes are you running here?
> >
> > A month back I ran some one-rack experiments:
> >
> https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar
> > e.png
> >
> > Here's my IOR config file. Note two tuning parameters here:
> > - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
> > for Blue Gene /Q
> > - the 'bglockless' prefix is "robl's secret turbo button". it was fun
> > to pull that rabbit out of the hat... for the first few years.
> > (it's not the default because in one specific case performance is
> > shockingly poor).
> >
> > IOR START
> > numTasks=65536
> > repetitions=3
> > reorderTasksConstant=1024
> > fsync=1
> > transferSize=6M
> > blockSize=6M
> > collective=1
> > showHints=1
> > hintsFileName=IOR-hints-bg_nodes_pset.64
> >
> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> > api.mpi
> > api=MPIIO
> > RUN
> > api=HDF5
> >
> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> > api.h5
> > RUN
> > api=NCMPI
> >
> > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> > api.nc
> > RUN
> > IOR STOP
> >
> >
> > > Rob : you mentioned some fcntl functions were deprecated etc. do I
> > > need to remove these to stop the coredumps? (I'm very much hoping
> > > something has gone wrong with my tests because the performance is
> > > shockingly bad ... ) (NB. my Version is 1.8.12-snap17)
> >
> > Unless you are running BGQ system software driver V1R2M1, the
> > gpfs_fcntl hints do not get forwarded to storage, and return an error.
> > It's possible HDF5 responds to that error with a core dump?
> >
> > ==rob
> >
> >
> > > JB
> > >
> > > > -----Original Message----- From: Hdf-forum
> > > > [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Daniel
> > > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> > > > single (shared) file
> > > >
> > > > Rob,
> > > >
> > > > thanks a lot for hints. I will look at the suggested option and
> > > > try some experiments with it :).
> > > >
> > > > Daniel
> > > >
> > > >
> > > >
> > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]
> > > > >> single file, best result: 17.2 [s]
> > > > >>
> > > > >> (I did multiple runs with various combinations of strip count
> > > > >> and size, presenting the best results I have obtained.)
> > > > >>
> > > > >> Increasing the number of stripes obviously helped a lot, but
> > > > >> comparing with the separate-files strategy, the writing time is
> > > > >> still more than ten times slower . Do you think it is "normal"?
> > > > >
> > > > > It might be "normal" for Lustre, but it's not good. I wish I
> > > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do
> not.
> > > > > The ADIOS folks report tuned-HDF5 to a single shared file runs
> > > > > about 60% slower than ADIOS to multiple files, not 10x slower,
> > > > > so it seems there is room for improvement.
> > > > >
> > > > > I've asked them about the kinds of things "tuned HDF5" entails,
> > > > > and they didn't know (!).
> > > > >
> > > > > There are quite a few settings documented in the intro_mpi(3)
> > > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most
> > > > > important thing you can try. I'm sorry to report that in my
> > > > > limited experience, the documentation and reality are sometimes
> > > > > out of sync, especially with respect to which settings are
> > > > > default or not.
> > > > >
> > > > > ==rob
> > > > >
> > > > >> Thanks, Daniel
> > > > >>
> > > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > > >>> I've run some benchmark, where within an MPI program, each
> > > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> > > > >>> I've used the following writing strategies:
> > > > >>>
> > > > >>> 1) each process writes to its own file, 2) each process writes
> > > > >>> to the same file to its own dataset, 3) each process writes to
> > > > >>> the same file to a same dataset.
> > > > >>>
> > > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> > > > >>> 1024), and I've tested 2)-3) for both independent/collective
> > > > >>> options of the MPI driver. I've also used 3 different clusters
> > > > >>> for measurements (all quite modern).
> > > > >>>
> > > > >>> As a result, the running (storage) times of the same-file
> > > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer
> > > > >>> than the running times of the separate-files strategy. For
> > > > >>> illustration:
> > > > >>>
> > > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> > > > >>> data, fixed data sets:
> > > > >>>
> > > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> > > > >>> separate data sets: 88.54[s]
> > > > >>>
> > > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> > > > >>> data, chunked data sets (chunk size 1024):
> > > > >>>
> > > > >>> 1) separate files: 10.40 [s] 2) single file, independent
> > > > >>> calls, shared data sets: 295 [s] 3) single file, collective
> > > > >>> calls, shared data sets: 3275 [s]
> > > > >>>
> > > > >>> Any idea why the single-file strategy gives so poor writing
> > > > >>> performance?
> > > > >>>
> > > > >>> Daniel
> > > > >>
> > > > >> _______________________________________________ Hdf-
> > forum is for
> > > > >> HDF software users discussion.
> > > > >> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists
> > > > >> .h
> > > > >> dfgr
> > > > >> oup.org
> > > > >
> > > >
> > > > _______________________________________________ Hdf-
> forum
> > is for HDF
> > > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd
> > > > fg
> > > > roup.org
> > >
> > > _______________________________________________ Hdf-forum
> is
> > for HDF
> > > software users discussion. Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg
> > > ro
> > > up.org
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division Argonne National Lab, IL USA
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
> > up.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA