HDF5 and GFPS optimizations

Rob,

Did you make any significant discoveries/progress regarding the GPFS tweaks on BG systems. Our machine will be open for use within the next week or so and I'd like to begin some profiling. I'd be interested in knowing if you have discovered any useful facts that I ought to know about.

I'm concerned about how much the --enable-gpfs option is able to 'know' about the system (can we easily find out what the option does?). According to my superficial understanding of the BG architecture, it seems that since the compute nodes have IO calls forwarded off to the IO nodes by kernel level routines, collective operations performed by hdf5 might actually reduce the effectiveness of the IO by forcing the data to be shuffled around twice instead of once. Am I thinking along the right lines?

Ta

JB

We're exploring ways to get better MPI-IO performance out of our Blue
Gene systems running GPFS. HDF5 happens to have a nice collection of
GPFS-specific optimizations if you --enable-gpfs.

Before I spend much time experimenting with those options, I was
curious if anyone's tried them with recent (gpfs-3.4 or gpfs-3.5)
versions of GPFS. I suspect they still work (the gpfs-specific
IOCTLS, i mean: i'm sure HDF5's implementation of them is fine), but
would like to hear others experiences.

==rob

···

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob,

Did you make any significant discoveries/progress regarding the GPFS tweaks on BG systems. Our machine will be open for use within the next week or so and I'd like to begin some profiling. I'd be interested in knowing if you have discovered any useful facts that I ought to know about.

An upcoming driver update (I don't know which one) will allow the Blue
Gene compute nodes to send the gpfs_fcntl commands all the way through
to the GPFS file system (presently the gpfs_fcntl commands return "not
supported". Then, we can do some experiments to see if they still
provide any benefit at Blue Gene scales (the optimizations are 15
years old at this point, designed when "massively parallel system" was
32 nodes.

More generally, I've found that some of the default MPI-IO settings
are probably not ideal for /Q, and have tested/suggested a change to
the "number of I/O aggregators" defaults.

Meanwhile, ALCF (the folks who operate the machine) have been working
with IBM to improve the state of collective I/O. Seems like we're
making some progress there as well.

I'm concerned about how much the --enable-gpfs option is able to
'know' about the system (can we easily find out what the option
does?). According to my superficial understanding of the BG
architecture, it seems that since the compute nodes have IO calls
forwarded off to the IO nodes by kernel level routines, collective
operations performed by hdf5 might actually reduce the effectiveness
of the IO by forcing the data to be shuffled around twice instead of
once. Am I thinking along the right lines?

The --enable-gpfs option will attempt to do a few things:

gpfs_access_range
gpfs_free_range

This is the "multiple access range" hint, which tells GPFS "hey, don't
grab a lock on the whole file. instead, just these sections". I
*think* this is going to be one of the better improvements remaining.

gpfs_clear_file_cache
gpfs_invalidate_file_cache

Good for benchmarking. Ejects all entries from the gpfs page pool.

gpfs_cancel_hints

just resets things

gpfs_start_data_shipping
gpfs_start_data_ship_map
gpfs_stop_data_shipping

Unfortunately, GPFS-3.5 does not support data shipping any longer.

I still think these hints need to be implemented in the MPI-IO
library, if they still help at all, but if one is being pragmatic one
might more easily deploy the hints through HDF5.

==rob

···

On Mon, Aug 26, 2013 at 06:15:30AM +0000, Biddiscombe, John A. wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

To be honest, I do not have much knowledge on what HDF5 does for GPFS specific optimizations. Someone else can jump in and fill in this information.

But I do know that HDF5 does not reshuffle data around. It uses MPI-I/O for that. So I'm guessing if you do not want data to be reshuffled with ROMIO's two-phase I/O, just use independent I/O. Or use a different collective I/O algorithm if available. I'm guessing the best choice is to use a ROMIO GPFS driver? Not sure if one exists, Rob can answer that question, but you can always write one yourself :slight_smile:

Thanks,
Mohamad

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Biddiscombe, John A.
Sent: Monday, August 26, 2013 1:16 AM
To: HDF Users Discussion List
Subject: [Hdf-forum] HDF5 and GFPS optimizations

Rob,

Did you make any significant discoveries/progress regarding the GPFS tweaks on BG systems. Our machine will be open for use within the next week or so and I'd like to begin some profiling. I'd be interested in knowing if you have discovered any useful facts that I ought to know about.

I'm concerned about how much the --enable-gpfs option is able to 'know' about the system (can we easily find out what the option does?). According to my superficial understanding of the BG architecture, it seems that since the compute nodes have IO calls forwarded off to the IO nodes by kernel level routines, collective operations performed by hdf5 might actually reduce the effectiveness of the IO by forcing the data to be shuffled around twice instead of once. Am I thinking along the right lines?

Ta

JB

We're exploring ways to get better MPI-IO performance out of our Blue
Gene systems running GPFS. HDF5 happens to have a nice collection of
GPFS-specific optimizations if you --enable-gpfs.

Before I spend much time experimenting with those options, I was
curious if anyone's tried them with recent (gpfs-3.4 or gpfs-3.5)
versions of GPFS. I suspect they still work (the gpfs-specific
IOCTLS, i mean: i'm sure HDF5's implementation of them is fine), but
would like to hear others experiences.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob

Thanks very much for this info. I've been reading the manuals and getting up to speed with the system. I've set some benchmarks running for parallel IO using multiple datasets, compound data types etc etc.

when you say ...

More generally, I've found that some of the default MPI-IO settings are
probably not ideal for /Q, and have tested/suggested a change to the
"number of I/O aggregators" defaults.

Do you mean aggregators inside romio, or gpfs itself. I was under the impression that on BGQ machines (which is what I'm targeting), the IO was shipped to the IO nodes which performed aggregation anyway. This is what I was referring to when I said "shuffling data twice" - there's no point in hdf/mpiio performing collective IO as this task was being done by the OS. Am I to understand that the IO nodes don't natively do a very good job of it and need some assistance?

thanks

JB

···

-----Original Message-----
From: Rob Latham [mailto:robl@mcs.anl.gov]
Sent: 26 August 2013 16:38
To: Biddiscombe, John A.
Cc: HDF Users Discussion List
Subject: Re: HDF5 and GFPS optimizations

On Mon, Aug 26, 2013 at 06:15:30AM +0000, Biddiscombe, John A. wrote:
> Rob,
>
> Did you make any significant discoveries/progress regarding the GPFS
tweaks on BG systems. Our machine will be open for use within the next
week or so and I'd like to begin some profiling. I'd be interested in knowing
if you have discovered any useful facts that I ought to know about.

An upcoming driver update (I don't know which one) will allow the Blue
Gene compute nodes to send the gpfs_fcntl commands all the way through
to the GPFS file system (presently the gpfs_fcntl commands return "not
supported". Then, we can do some experiments to see if they still provide
any benefit at Blue Gene scales (the optimizations are 15 years old at this
point, designed when "massively parallel system" was
32 nodes.

More generally, I've found that some of the default MPI-IO settings are
probably not ideal for /Q, and have tested/suggested a change to the
"number of I/O aggregators" defaults.

Meanwhile, ALCF (the folks who operate the machine) have been working
with IBM to improve the state of collective I/O. Seems like we're making
some progress there as well.

> I'm concerned about how much the --enable-gpfs option is able to
> 'know' about the system (can we easily find out what the option
> does?). According to my superficial understanding of the BG
> architecture, it seems that since the compute nodes have IO calls
> forwarded off to the IO nodes by kernel level routines, collective
> operations performed by hdf5 might actually reduce the effectiveness
> of the IO by forcing the data to be shuffled around twice instead of
> once. Am I thinking along the right lines?

The --enable-gpfs option will attempt to do a few things:

gpfs_access_range
gpfs_free_range

This is the "multiple access range" hint, which tells GPFS "hey, don't grab a
lock on the whole file. instead, just these sections". I
*think* this is going to be one of the better improvements remaining.

gpfs_clear_file_cache
gpfs_invalidate_file_cache

Good for benchmarking. Ejects all entries from the gpfs page pool.

gpfs_cancel_hints

just resets things

gpfs_start_data_shipping
gpfs_start_data_ship_map
gpfs_stop_data_shipping

Unfortunately, GPFS-3.5 does not support data shipping any longer.

I still think these hints need to be implemented in the MPI-IO library, if they
still help at all, but if one is being pragmatic one might more easily deploy
the hints through HDF5.

==rob

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA

Rob

Thanks very much for this info. I've been reading the manuals and getting up to speed with the system. I've set some benchmarks running for parallel IO using multiple datasets, compound data types etc etc.

when you say ...

> More generally, I've found that some of the default MPI-IO settings are
> probably not ideal for /Q, and have tested/suggested a change to the
> "number of I/O aggregators" defaults.

Do you mean aggregators inside romio, or gpfs itself.

I'm speaking about the MP-IO (romio) library. For Blue Gene, the code
hasn't changed too much since /L. Our /Q has 64x more parallelism per
node than /L, so one can imagine the assumptions made in 2004 might
need to be updated :>

Some of that is simple tuning of defaults. We're also talking with
IBM guys about some more substantial ROMIO changes.

I was under the impression that on BGQ machines (which is what I'm targeting), the IO was shipped to the IO nodes which performed aggregation anyway. This is what I was referring to when I said "shuffling data twice" - there's no point in hdf/mpiio performing collective IO as this task was being done by the OS. Am I to understand that the IO nodes don't natively do a very good job of it and need some assistance?

The I/O nodes on Blue Gene have never been sophisticated. They relay
system calls. the end. No re-ordering, no coalescing, no caching
(ok, GPFS has a page pool on the io node, but that's GPFS doing the
caching, not the I/O node daemon, so I make a distinction).

==rob

···

On Tue, Sep 03, 2013 at 07:21:16PM +0000, Biddiscombe, John A. wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob,

Thanks. I understand the issues better now. I'll let you fix ROMIO then, and I'll get on with a vol plugin for shipping data off to our BGAS nodes bypassing the current drivers so I don't have to worry about some of those issues ...

JB

···

-----Original Message-----
From: Rob Latham [mailto:robl@mcs.anl.gov]
Sent: 04 September 2013 15:52
To: Biddiscombe, John A.
Cc: HDF Users Discussion List
Subject: Re: HDF5 and GFPS optimizations

On Tue, Sep 03, 2013 at 07:21:16PM +0000, Biddiscombe, John A. wrote:
> Rob
>
> Thanks very much for this info. I've been reading the manuals and getting
up to speed with the system. I've set some benchmarks running for parallel
IO using multiple datasets, compound data types etc etc.
>
> when you say ...
>
> > More generally, I've found that some of the default MPI-IO settings
> > are probably not ideal for /Q, and have tested/suggested a change to
> > the "number of I/O aggregators" defaults.
>
> Do you mean aggregators inside romio, or gpfs itself.

I'm speaking about the MP-IO (romio) library. For Blue Gene, the code hasn't
changed too much since /L. Our /Q has 64x more parallelism per node than
/L, so one can imagine the assumptions made in 2004 might need to be
updated :>

Some of that is simple tuning of defaults. We're also talking with
IBM guys about some more substantial ROMIO changes.

> I was under the impression that on BGQ machines (which is what I'm
targeting), the IO was shipped to the IO nodes which performed aggregation
anyway. This is what I was referring to when I said "shuffling data twice" -
there's no point in hdf/mpiio performing collective IO as this task was being
done by the OS. Am I to understand that the IO nodes don't natively do a
very good job of it and need some assistance?

The I/O nodes on Blue Gene have never been sophisticated. They relay
system calls. the end. No re-ordering, no coalescing, no caching (ok, GPFS
has a page pool on the io node, but that's GPFS doing the caching, not the I/O
node daemon, so I make a distinction).

==rob

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA