Slow writing parallel HDF5 performance (for only one variable)

Hi,

I'm having a hard time trying to figure out what could be causing the
slow I/O behaviour that I see: the same code (in Fortran) run in three
different clusters behaves pretty similar in terms of I/O times, except
for one of the variables in the code, where I get two orders of
magnitude slower writes in one of machines (last timing data in the
e-mail). So I hope that somebody with more in-depth knowledge of
Parallel HDF5 can give me a hand with it.

This is the situation. Our code writes to files two types of variables:
the first type are 3D variables that have been decomposed with 3D
decomposition across different processors, and I use hyperslabs to
select where each part should go. Using arrays of size 200x200x200 that
have been decomposed in 64 processors I get similar times for the
reading and writing (each file 794MB) routines in three clusters that I
have access to:

Cluster 1:

···

------
READING 0.1231E+01
WRITING 0.1600E+01

Cluster 2:
------
READING 0.1973E+01
WRITING 0.2544E+01

Cluster 3:
-----
READING 0.1274E+01
WRITING 0.5895E+01

As you can see there is some variation, but I would be happy with this
sort of behaviour.

The other type of data that I write to disk are like outside layers of
the 3D cube. So, for example, in the 200x200x200 cube above, I have six
outside layers, two in each dimension. The depth of this layers can vary,
but in this example I'm using 24 cells, so the X layers would be in this
case 24x200x200. But for each of these layers I need to save 24
variables, so in reality I end up with 4D arrays. In this particular
example, for the outside layers in X dimension, we have 4D arrays of
size 24x200x200x24, for Y 200x24x200x24 and for Z 200x200x24x24.

So now the fun begins. If I tell my code to only save the X outside
layers, I end up with files of 1.2GB and the times in the 3 clusters
where I've been running these tests are:

Cluster 1:
-----
READING 0.1270E+01
WRITING 0.2088E+01

Cluster 2:
-----
READING 0.2214E+01
WRITING 0.3826E+01

Cluster 3::
-----
READING 0.1279E+01
WRITING 0.7138E+01

If I only save the outside layers in Y, I get also 1.2GB files, and the times:

Cluster 1:
-----
READING 0.1207E+01
WRITING 0.1832E+01

Cluster 2:
-----
READING 0.1606E+01
WRITING 0.3895E+01

Cluster 3::
-----
READING 0.1264E+01
WRITING 0.6670E+01

But if I ask to only save the outside layers in Z, I also get 1.2GB
files, but the times are:

Cluster 1:
-----
READING 0.7905E+00
WRITING 0.2190E+01

Cluster 2:
-----
READING 0.1856E+01
WRITING 0.8722E+02

Cluster 3:
-----
READING 0.1252E+01
WRITING 0.2372E+03

What can be so different about the Z dimension to get I/O behaviours so
different in the three clusters? (Needless to say the code is exactly
the same, the input data is exactly the same...)

Any pointers are more than welcome,
--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

Hi,

Angel de Vicente <angelv@iac.es> writes:

I'm having a hard time trying to figure out what could be causing the
slow I/O behaviour that I see: the same code (in Fortran) run in three
different clusters behaves pretty similar in terms of I/O times, except
for one of the variables in the code, where I get two orders of
magnitude slower writes in one of machines (last timing data in the
e-mail). So I hope that somebody with more in-depth knowledge of
Parallel HDF5 can give me a hand with it.

regarding this issue, I extracted only the relevant parts of the code
and I can reproduce the behaviour I was explaining in the previous
e-mail with a very simple code, which you can see at:

http://pastebin.com/HjnS82Gp

(to compile it I just do h5pfc -o phdf5write timing.f90 phdf5write.f90,
where timing.f90 is timing routine by Arjen Markus, available at
http://pastebin.com/480RmNET)

The code generates a 4D array with the relevant sizes, and writes the
relevant part of the data to a file in three modes PMLX, PMLY, and
PMLZ (these replicate the ranks that I would get for X,Y and Z planes
respectively when creating a Cartesian topology with the relevant MPI
routines).

The clusters where I run this code all have 16 cores per node. When I
run this test code in only 8 cores (nblocks set to 2), then the three
clusters behave similarly and there is no penalty for writing the
PMLZ. When I run it in 64 nodes, only PMLZ is heavily penalized, and
only very badly in one cluster, while badly in the other one. It looks
like there is some contention issue with the Parallel file system when
the cores span a number of nodes, but I certainly don't understand why
it only affects the PMLZ variable and not the PMLY, and why one of the
clusters doesn't seem to be affected.

Is there something in the code something that is calling for trouble?
Any ideas/pointers/suggestions?

Thanks a lot,

···

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

Can you be more specific about the hardware and the software you are using for each case (especially for the “very bad” case)?
What architecture?
Parallel file system type?
What compiler/mpi type and version?
What version of HDF?

These are the timings for your program on GPFS using hdf5 trunk, xlf compiler, mpich 3.1.1. I don’t see a large difference in writing times between datasets.

8 cores:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.2100E+00 0.2000E+00 0.2100E+00 0.2100E+00
WRITINGPMLY 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00
WRITINGPMLZ 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.4500E+00 0.4500E+00 0.4500E+00 0.4500E+00
WRITINGPMLY 1 0.4000E+00 0.4000E+00 0.4000E+00 0.4000E+00
WRITINGPMLZ 1 0.4400E+00 0.4500E+00 0.4400E+00 0.4400E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.1470E+01 0.1460E+01 0.1470E+01 0.1470E+01
WRITINGPMLY 1 0.1580E+01 0.1580E+01 0.1580E+01 0.1580E+01
WRITINGPMLZ 1 0.1730E+01 0.1730E+01 0.1730E+01 0.1730E+01

1024 cores:

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5118E+02 0.5118E+02 0.5118E+02 0.5118E+02
WRITINGPMLY 1 0.5228E+02 0.5228E+02 0.5228E+02 0.5228E+02
WRITINGPMLZ 1 0.5296E+02 0.5296E+02 0.5296E+02 0.5296E+02

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5185E+02 0.5185E+02 0.5185E+02 0.5185E+02
WRITINGPMLY 1 0.5543E+02 0.5543E+02 0.5543E+02 0.5543E+02
WRITINGPMLZ 1 0.5675E+02 0.5675E+02 0.5675E+02 0.5675E+02

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5035E+02 0.5035E+02 0.5035E+02 0.5035E+02
WRITINGPMLY 1 0.5739E+02 0.5739E+02 0.5739E+02 0.5739E+02
WRITINGPMLZ 1 0.5174E+02 0.5175E+02 0.5174E+02 0.5174E+02

···

On Jul 10, 2014, at 6:08 PM, Angel de Vicente <angelv@iac.es> wrote:

Hi,

Angel de Vicente <angelv@iac.es> writes:

I'm having a hard time trying to figure out what could be causing the
slow I/O behaviour that I see: the same code (in Fortran) run in three
different clusters behaves pretty similar in terms of I/O times, except
for one of the variables in the code, where I get two orders of
magnitude slower writes in one of machines (last timing data in the
e-mail). So I hope that somebody with more in-depth knowledge of
Parallel HDF5 can give me a hand with it.

regarding this issue, I extracted only the relevant parts of the code
and I can reproduce the behaviour I was explaining in the previous
e-mail with a very simple code, which you can see at:

http://pastebin.com/HjnS82Gp

(to compile it I just do h5pfc -o phdf5write timing.f90 phdf5write.f90,
where timing.f90 is timing routine by Arjen Markus, available at
http://pastebin.com/480RmNET)

The code generates a 4D array with the relevant sizes, and writes the
relevant part of the data to a file in three modes PMLX, PMLY, and
PMLZ (these replicate the ranks that I would get for X,Y and Z planes
respectively when creating a Cartesian topology with the relevant MPI
routines).

The clusters where I run this code all have 16 cores per node. When I
run this test code in only 8 cores (nblocks set to 2), then the three
clusters behave similarly and there is no penalty for writing the
PMLZ. When I run it in 64 nodes, only PMLZ is heavily penalized, and
only very badly in one cluster, while badly in the other one. It looks
like there is some contention issue with the Parallel file system when
the cores span a number of nodes, but I certainly don't understand why
it only affects the PMLZ variable and not the PMLY, and why one of the
clusters doesn’t seem to be affected.

Hi Scot,

thanks for trying this out.

Scot Breitenfeld <brtnfld@hdfgroup.org> writes:

Can you be more specific about the hardware and the software you are
using for each case (especially for the “very bad” case)?
What architecture?
Parallel file system type?
What compiler/mpi type and version?
What version of HDF?

If we focus on the "very bad" case:

Hardware:
+ Each node has 2x E5–2670 SandyBridge-EP chips, for a total of 16 cores
  per node
+ Network is Infiniband
+ Parallel file system: GPFS

As per the software versions:
+ Intel compilers, version: 13.0.1 20121010
+ Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130507
+ HDF version: HDF5 1.8.10

The "good" case:

Hardware:
+ Each node has 2x E5-2680 SandaBridge chips, for a total of 16 cores
  per node
+ Network is Infiniband
+ Parallel file system: Lustre

Software:
+ Intel compilers, version 14.0.3 20140422
+ BullXMPI, which AFAIK is a fork of Open MPI, version. 1.2.7.2
+ HDF version: HDF5 1.8.9

These are the timings for your program on GPFS using hdf5 trunk, xlf
compiler, mpich 3.1.1. I don’t see a large difference in writing times
between datasets.

These timings look really good, but how did you run the 1024 cores one?
I mean, the code in Pastebin assumes that it will be run with 64 cores
(nblocks = 4), so I guess for the 8 cores run you set that to (nblocks =
2). And for 1024 cores?

Again, thanks a lot for your help. Any pointer appreciated,
Ángel de Vicente

···

8 cores:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.2100E+00 0.2000E+00 0.2100E+00 0.2100E+00
WRITINGPMLY 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00
WRITINGPMLZ 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.4500E+00 0.4500E+00 0.4500E+00 0.4500E+00
WRITINGPMLY 1 0.4000E+00 0.4000E+00 0.4000E+00 0.4000E+00
WRITINGPMLZ 1 0.4400E+00 0.4500E+00 0.4400E+00 0.4400E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.1470E+01 0.1460E+01 0.1470E+01 0.1470E+01
WRITINGPMLY 1 0.1580E+01 0.1580E+01 0.1580E+01 0.1580E+01
WRITINGPMLZ 1 0.1730E+01 0.1730E+01 0.1730E+01 0.1730E+01

1024 cores:

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5118E+02 0.5118E+02 0.5118E+02 0.5118E+02
WRITINGPMLY 1 0.5228E+02 0.5228E+02 0.5228E+02 0.5228E+02
WRITINGPMLZ 1 0.5296E+02 0.5296E+02 0.5296E+02 0.5296E+02

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5185E+02 0.5185E+02 0.5185E+02 0.5185E+02
WRITINGPMLY 1 0.5543E+02 0.5543E+02 0.5543E+02 0.5543E+02
WRITINGPMLZ 1 0.5675E+02 0.5675E+02 0.5675E+02 0.5675E+02

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5035E+02 0.5035E+02 0.5035E+02 0.5035E+02
WRITINGPMLY 1 0.5739E+02 0.5739E+02 0.5739E+02 0.5739E+02
WRITINGPMLZ 1 0.5174E+02 0.5175E+02 0.5174E+02 0.5174E+02

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

If we focus on the "very bad" case:

Hardware:
+ Each node has 2x E5�2670 SandyBridge-EP chips, for a total of 16 cores
   per node
+ Network is Infiniband
+ Parallel file system: GPFS

As per the software versions:
+ Intel compilers, version: 13.0.1 20121010
+ Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130507
+ HDF version: HDF5 1.8.10

Intel's MPI library does not have any explicit optimizations for GPFS, but the one optimization you need for GPFS is to align writes to the file system block size.

you can do this with an MPI-IO hint: set "striping_unit" to your gpfs block size (you can determine the gpfs block size via 'stat -f': see the 'Block size:' field.

Setting an MPI-IO hint via HDF5 requires setting up your file access property list appropriately: you will need a non-null INFO parameter to H5Pset_fapl_mpio

  http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFaplMpio

in C, it's like this:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "striping_unit", "8388608") ;
  /* or whatever your GPFS block size actually is*/
H5Pset_fapl_mpio(fapl, comm, info);

If you're with me so far, I think you'll see much better parallel write performance once the MPI-IO library is trying harder to align writes.

Are you familiar with the Darshan statistics tool? you can use it to confirm you are hitting (or not) unaligned writes.

==rob

···

On 07/22/2014 01:31 AM, Angel de Vicente wrote:

The "good" case:

Hardware:
+ Each node has 2x E5-2680 SandaBridge chips, for a total of 16 cores
   per node
+ Network is Infiniband
+ Parallel file system: Lustre

Software:
+ Intel compilers, version 14.0.3 20140422
+ BullXMPI, which AFAIK is a fork of Open MPI, version. 1.2.7.2
+ HDF version: HDF5 1.8.9

These are the timings for your program on GPFS using hdf5 trunk, xlf
compiler, mpich 3.1.1. I don�t see a large difference in writing times
between datasets.

These timings look really good, but how did you run the 1024 cores one?
I mean, the code in Pastebin assumes that it will be run with 64 cores
(nblocks = 4), so I guess for the 8 cores run you set that to (nblocks =
2). And for 1024 cores?

Again, thanks a lot for your help. Any pointer appreciated,
�ngel de Vicente

8 cores:

  Timing report:
  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.2100E+00 0.2000E+00 0.2100E+00 0.2100E+00
  WRITINGPMLY 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00
  WRITINGPMLZ 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.4500E+00 0.4500E+00 0.4500E+00 0.4500E+00
  WRITINGPMLY 1 0.4000E+00 0.4000E+00 0.4000E+00 0.4000E+00
  WRITINGPMLZ 1 0.4400E+00 0.4500E+00 0.4400E+00 0.4400E+00

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.1470E+01 0.1460E+01 0.1470E+01 0.1470E+01
  WRITINGPMLY 1 0.1580E+01 0.1580E+01 0.1580E+01 0.1580E+01
  WRITINGPMLZ 1 0.1730E+01 0.1730E+01 0.1730E+01 0.1730E+01

1024 cores:

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.5118E+02 0.5118E+02 0.5118E+02 0.5118E+02
  WRITINGPMLY 1 0.5228E+02 0.5228E+02 0.5228E+02 0.5228E+02
  WRITINGPMLZ 1 0.5296E+02 0.5296E+02 0.5296E+02 0.5296E+02

  Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.5185E+02 0.5185E+02 0.5185E+02 0.5185E+02
  WRITINGPMLY 1 0.5543E+02 0.5543E+02 0.5543E+02 0.5543E+02
  WRITINGPMLZ 1 0.5675E+02 0.5675E+02 0.5675E+02 0.5675E+02

   Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
  ---------------------------------------- (s) (s) (s) (s)
  WRITINGPMLX 1 0.5035E+02 0.5035E+02 0.5035E+02 0.5035E+02
  WRITINGPMLY 1 0.5739E+02 0.5739E+02 0.5739E+02 0.5739E+02
  WRITINGPMLZ 1 0.5174E+02 0.5175E+02 0.5174E+02 0.5174E+02

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

My mistake, results for 300:

nblocks = 2, run on 8 cores (1 thread per core), 1 node.

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.8300E+00 0.8300E+00 0.8300E+00 0.8300E+00
WRITINGPMLY 1 0.8100E+00 0.8100E+00 0.8100E+00 0.8100E+00
WRITINGPMLZ 1 0.7500E+00 0.7400E+00 0.7500E+00 0.7500E+00

nblocks = 4, run on 64 cores (1 thread per core), 4 nodes.

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.7800E+00 0.7700E+00 0.7800E+00 0.7800E+00
WRITINGPMLY 1 0.1020E+01 0.1020E+01 0.1020E+01 0.1020E+01
WRITINGPMLZ 1 0.8500E+00 0.8500E+00 0.8500E+00 0.8500E+00

I’m not sure if that is too terrible. I did not use any MPI-IO hints.

An example in Fortran for MPI hints is:

INTEGER :: info

CALL MPI_Info_create(info, mpierror )
CALL MPI_Info_set(info, “IBM_largeblock_io", "true", mpierror);

CALL h5pset_fapl_mpio_f(plist_id, MPI_COMM_WORLD, info, hdferr)

You can change "quoted” code to Rob’s suggestions and like Rob said, running it with Darshan should help in tuning,

http://www.mcs.anl.gov/research/projects/darshan/

Scot

···

On Jul 22, 2014, at 1:31 AM, Angel de Vicente <angelv@iac.es<mailto:angelv@iac.es>> wrote:

Hi Scot,

thanks for trying this out.

Scot Breitenfeld <brtnfld@hdfgroup.org<mailto:brtnfld@hdfgroup.org>> writes:
Can you be more specific about the hardware and the software you are
using for each case (especially for the “very bad” case)?
What architecture?
Parallel file system type?
What compiler/mpi type and version?
What version of HDF?

If we focus on the "very bad" case:

Hardware:
+ Each node has 2x E5–2670 SandyBridge-EP chips, for a total of 16 cores
per node
+ Network is Infiniband
+ Parallel file system: GPFS

As per the software versions:
+ Intel compilers, version: 13.0.1 20121010
+ Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130507
+ HDF version: HDF5 1.8.10

The "good" case:

Hardware:
+ Each node has 2x E5-2680 SandaBridge chips, for a total of 16 cores
per node
+ Network is Infiniband
+ Parallel file system: Lustre

Software:
+ Intel compilers, version 14.0.3 20140422
+ BullXMPI, which AFAIK is a fork of Open MPI, version. 1.2.7.2
+ HDF version: HDF5 1.8.9

These are the timings for your program on GPFS using hdf5 trunk, xlf
compiler, mpich 3.1.1. I don’t see a large difference in writing times
between datasets.

These timings look really good, but how did you run the 1024 cores one?
I mean, the code in Pastebin assumes that it will be run with 64 cores
(nblocks = 4), so I guess for the 8 cores run you set that to (nblocks =
2). And for 1024 cores?

Again, thanks a lot for your help. Any pointer appreciated,
Ángel de Vicente

8 cores:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.2100E+00 0.2000E+00 0.2100E+00 0.2100E+00
WRITINGPMLY 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00
WRITINGPMLZ 1 0.1600E+00 0.1600E+00 0.1600E+00 0.1600E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.4500E+00 0.4500E+00 0.4500E+00 0.4500E+00
WRITINGPMLY 1 0.4000E+00 0.4000E+00 0.4000E+00 0.4000E+00
WRITINGPMLZ 1 0.4400E+00 0.4500E+00 0.4400E+00 0.4400E+00

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.1470E+01 0.1460E+01 0.1470E+01 0.1470E+01
WRITINGPMLY 1 0.1580E+01 0.1580E+01 0.1580E+01 0.1580E+01
WRITINGPMLZ 1 0.1730E+01 0.1730E+01 0.1730E+01 0.1730E+01

1024 cores:

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5118E+02 0.5118E+02 0.5118E+02 0.5118E+02
WRITINGPMLY 1 0.5228E+02 0.5228E+02 0.5228E+02 0.5228E+02
WRITINGPMLZ 1 0.5296E+02 0.5296E+02 0.5296E+02 0.5296E+02

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5185E+02 0.5185E+02 0.5185E+02 0.5185E+02
WRITINGPMLY 1 0.5543E+02 0.5543E+02 0.5543E+02 0.5543E+02
WRITINGPMLZ 1 0.5675E+02 0.5675E+02 0.5675E+02 0.5675E+02

Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
WRITINGPMLX 1 0.5035E+02 0.5035E+02 0.5035E+02 0.5035E+02
WRITINGPMLY 1 0.5739E+02 0.5739E+02 0.5739E+02 0.5739E+02
WRITINGPMLZ 1 0.5174E+02 0.5175E+02 0.5174E+02 0.5174E+02

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consulthttp://www.iac.es/disclaimer.php?lang=en

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Hi,

Rob Latham <robl@mcs.anl.gov> writes:

Intel's MPI library does not have any explicit optimizations for GPFS, but the
one optimization you need for GPFS is to align writes to the file system block
size.

you can do this with an MPI-IO hint: set "striping_unit" to your gpfs block
size (you can determine the gpfs block size via 'stat -f': see the 'Block size:'
field.

Setting an MPI-IO hint via HDF5 requires setting up your file access property
list appropriately: you will need a non-null INFO parameter to H5Pset_fapl_mpio

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFaplMpio

in C, it's like this:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "striping_unit", "8388608") ;
/* or whatever your GPFS block size actually is*/
H5Pset_fapl_mpio(fapl, comm, info);

If you're with me so far, I think you'll see much better parallel write
performance once the MPI-IO library is trying harder to align writes.

I'm going to try your suggestion and will report back, but do you think
this could explain the different performance for PMLX, PMLY, and PMLZ?
In the sample code in pastebin, the data written to file in each of
these cases only differs by who is writing it (in the case of PMLX,
processors that in a Cartesian decomposition would be in the smallest X
plane, PMLY in the smallest Y plane, and PMLZ those in the smallest Z
plane), but the amount of data, the distribution of that data in memory
(for each processor), and the place where it is stored in the actual
file is the same...

Are you familiar with the Darshan statistics tool? you can use it to confirm
you are hitting (or not) unaligned writes.

Not really. Only heard about it, but I will try it. This issue is
proving pretty hard to figure out, and it is a real bottleneck for our
code, so I will try anything...

Thanks a lot,

···

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

If you're with me so far, I think you'll see much better parallel write
performance once the MPI-IO library is trying harder to align writes.

I'm going to try your suggestion and will report back, but do you think
this could explain the different performance for PMLX, PMLY, and PMLZ?
In the sample code in pastebin, the data written to file in each of
these cases only differs by who is writing it (in the case of PMLX,
processors that in a Cartesian decomposition would be in the smallest X
plane, PMLY in the smallest Y plane, and PMLZ those in the smallest Z
plane), but the amount of data, the distribution of that data in memory
(for each processor), and the place where it is stored in the actual
file is the same...

There are two things that might be happening when the layout and the file system interact.

For some decompositions, MPI-IO might not even use collective I/O. (Except on blue gene) ROMIO will check for "interleave" -- if each process accesses a continguous region already, there's little benefit to two-phase. or at least that's what we thought 15 years ago. it's a little more complicated today...

Some decompositions might introduce a "hole": in order to carry out a partial update, ROMIO will "data sieve" the request, instead of updating piece by piece. ROMIO will read into a buffer, update the regions, then write out a large contiguous request. Often this is a good optimizations, but sometimes the holes are so small that the overhead of the read outweighs the benefits.

Are you familiar with the Darshan statistics tool? you can use it to confirm
you are hitting (or not) unaligned writes.

Not really. Only heard about it, but I will try it. This issue is
proving pretty hard to figure out, and it is a real bottleneck for our
code, so I will try anything...

Yeah, i'm getting off into the woods here, so a tool like Darshan can help you answer the low-level details I'm bugging you about.

==rob

···

On 07/27/2014 05:58 PM, Angel de Vicente wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi,

resurrecting an old thread...

Rob Latham <robl@mcs.anl.gov> writes:

Are you familiar with the Darshan statistics tool? you can use it to confirm
you are hitting (or not) unaligned writes.

Not really. Only heard about it, but I will try it. This issue is
proving pretty hard to figure out, and it is a real bottleneck for our
code, so I will try anything...

Yeah, i'm getting off into the woods here, so a tool like Darshan can help you
answer the low-level details I'm bugging you about.

at last I got the chance to try Darshan.

Just as a reminder, I have this problem when writing data that comes
from a 4D array to an HDF5 file collectively. The code that shows the
problem (not everywhere, but very badly in the particular cluster I'm
using right now) is attached (phdf5write.f90). As it is, the code is
meant to run in 64 processors, The global data to be written to the file
is a 4D array of 100x100x24x12. The code runs with 64 processors and
each has a 4D array of dimensions 25x25x24x12. When writing to the file,
only 16 processors dump their data to the file. Those 16 processors dump
their whole data while the other processors none. The only thing that
changes from the three possible "modes" of the code are which processors
do the writing, and the offsets in the file. So far, I managed to run it
without any issues in the CURIE cluster, where either mode behaves
similarly and (for this particular case), the writing of the files takes
about a second. But in two local clusters I run into big problems for
mode 3 (PMLZ).

Until now I only knew that this third mode (PMLZ) took much longer
(about two orders of magnitude more). Now with Darshan I see that
something weird is going on... Modes 1 and 2 are very similar in the
time they take and in the Darshan reports, but mode 3 is completely
weird to me. For starters, it says that the code spends a lot of time
READING files, doing Metadata operations, ... while the code only
writes data. With the hope that someone more experienced than me with
I/O issues can shed some light into this issue, I attach the Darshan
reports for Mode 1 and 3. Any help/pointers much appreciated.

phdf5write.f90 (8 KB)

pr1e1c02_phdf5write_id8896_10-24-39119-10735803981073517258_1.pdf (59.7 KB)

pr1e1c02_phdf5write_id10933_10-24-39138-12980417180950970533_1.pdf (60.6 KB)

Hi,

Angel de Vicente <angelv@iac.es> writes:

Until now I only knew that this third mode (PMLZ) took much longer
(about two orders of magnitude more). Now with Darshan I see that
something weird is going on... Modes 1 and 2 are very similar in the
time they take and in the Darshan reports, but mode 3 is completely
weird to me. For starters, it says that the code spends a lot of time
READING files, doing Metadata operations, ... while the code only
writes data. With the hope that someone more experienced than me with
I/O issues can shed some light into this issue, I attach the Darshan
reports for Mode 1 and 3. Any help/pointers much appreciated.

extra info. I run the code with Darshan in CURIE cluster, where the code
behaves nicely. Exactly the same code, run in exactly the same way
produces a very nice Darshan I/O report (attached) for mode 3. Any
hypothesis on why the other cluster is behaving so badly with this
particular case?

Thanks a lot for any help,

us-ascii’'devicea_phdf5write_id2236305_10-24-55157-17510796315312663656_1.pdf (59.2 KB)

Hi,

resurrecting an old thread...

Rob Latham <robl@mcs.anl.gov> writes:

Are you familiar with the Darshan statistics tool? you can use it to confirm
you are hitting (or not) unaligned writes.

Not really. Only heard about it, but I will try it. This issue is
proving pretty hard to figure out, and it is a real bottleneck for our
code, so I will try anything...

Yeah, i'm getting off into the woods here, so a tool like Darshan can help you
answer the low-level details I'm bugging you about.

at last I got the chance to try Darshan.

Just as a reminder, I have this problem when writing data that comes
from a 4D array to an HDF5 file collectively. The code that shows the
problem (not everywhere, but very badly in the particular cluster I'm
using right now) is attached (phdf5write.f90). As it is, the code is
meant to run in 64 processors, The global data to be written to the file
is a 4D array of 100x100x24x12. The code runs with 64 processors and
each has a 4D array of dimensions 25x25x24x12. When writing to the file,
only 16 processors dump their data to the file. Those 16 processors dump
their whole data while the other processors none. The only thing that
changes from the three possible "modes" of the code are which processors
do the writing, and the offsets in the file. So far, I managed to run it
without any issues in the CURIE cluster, where either mode behaves
similarly and (for this particular case), the writing of the files takes
about a second. But in two local clusters I run into big problems for
mode 3 (PMLZ).

Until now I only knew that this third mode (PMLZ) took much longer
(about two orders of magnitude more). Now with Darshan I see that
something weird is going on... Modes 1 and 2 are very similar in the
time they take and in the Darshan reports, but mode 3 is completely
weird to me. For starters, it says that the code spends a lot of time
READING files, doing Metadata operations, ... while the code only
writes data. With the hope that someone more experienced than me with
I/O issues can shed some light into this issue, I attach the Darshan
reports for Mode 1 and 3. Any help/pointers much appreciated.

···

Sent from Samsung Mobile
-------- Original message --------
From: Angel de Vicente <angelv@iac.es>
Date: 24/10/2014 10:51 (GMT+00:00)
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Slow writing parallel HDF5 performance (for only one variable)

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/

---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

If you see a lot of read in your write-only workloads, it suggests that a "data sieving" optimization is kicking in. If there are only partial updates to a block of data, then something will read the whole block, update the changed bits, and write the new block out. I'm vauge about where the optimization happens because raid devices, file systems, and the ROMIO MPI-IO implementation could all be doing this.

With Collective I/O, you can transform your workload into something more contiguous and less likely to trigger data sieving, but it can still happen.

You can pass MPI-IO hints to hdf5 to turn off collective I/O *and* turn of data sieving -- this is the way Lustre folks got good performance in the 2008-ish time frame. it could either help you a lot or hurt you a lot. I cannot tell you more over email.

==rob

···

On 10/24/2014 11:22 AM, teodavid.shaw wrote:

Sent from Samsung Mobile

-------- Original message --------
From: Angel de Vicente <angelv@iac.es>
Date: 24/10/2014 10:51 (GMT+00:00)
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Slow writing parallel HDF5 performance (for
only one variable)

Hi,

resurrecting an old thread...

Rob Latham <robl@mcs.anl.gov> writes:
>>> Are you familiar with the Darshan statistics tool? you can use it
to confirm
>>> you are hitting (or not) unaligned writes.
>>
>> Not really. Only heard about it, but I will try it. This issue is
>> proving pretty hard to figure out, and it is a real bottleneck for our
>> code, so I will try anything...
>
> Yeah, i'm getting off into the woods here, so a tool like Darshan can
help you
> answer the low-level details I'm bugging you about.

at last I got the chance to try Darshan.

Just as a reminder, I have this problem when writing data that comes
from a 4D array to an HDF5 file collectively. The code that shows the
problem (not everywhere, but very badly in the particular cluster I'm
using right now) is attached (phdf5write.f90). As it is, the code is
meant to run in 64 processors, The global data to be written to the file
is a 4D array of 100x100x24x12. The code runs with 64 processors and
each has a 4D array of dimensions 25x25x24x12. When writing to the file,
only 16 processors dump their data to the file. Those 16 processors dump
their whole data while the other processors none. The only thing that
changes from the three possible "modes" of the code are which processors
do the writing, and the offsets in the file. So far, I managed to run it
without any issues in the CURIE cluster, where either mode behaves
similarly and (for this particular case), the writing of the files takes
about a second. But in two local clusters I run into big problems for
mode 3 (PMLZ).

Until now I only knew that this third mode (PMLZ) took much longer
(about two orders of magnitude more). Now with Darshan I see that
something weird is going on... Modes 1 and 2 are very similar in the
time they take and in the Darshan reports, but mode 3 is completely
weird to me. For starters, it says that the code spends a lot of time
READING files, doing Metadata operations, ... while the code only
writes data. With the hope that someone more experienced than me with
I/O issues can shed some light into this issue, I attach the Darshan
reports for Mode 1 and 3. Any help/pointers much appreciated.

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/

---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n
de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law
concerning the Protection of Data, consult
http://www.iac.es/disclaimer.php?lang=en

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi,

Rob Latham <robl@mcs.anl.gov> writes:

If you see a lot of read in your write-only workloads, it suggests that a "data
sieving" optimization is kicking in. If there are only partial updates to a
block of data, then something will read the whole block, update the changed
bits, and write the new block out. I'm vauge about where the optimization
happens because raid devices, file systems, and the ROMIO MPI-IO implementation
could all be doing this.

With Collective I/O, you can transform your workload into something more
contiguous and less likely to trigger data sieving, but it can still happen.

You can pass MPI-IO hints to hdf5 to turn off collective I/O *and* turn of data
sieving -- this is the way Lustre folks got good performance in the 2008-ish
time frame. it could either help you a lot or hurt you a lot. I cannot tell
you more over email.

thanks. I'm not sure if these could be tuned a bit better, but with the
following hints the problem is all gone in the two problematic clusters
(for a given file size, one of the writing modes of the program was
taking about ~200x more time. With these hints all is back to normal,
and the problematic mode takes just the same time as the other ones).

call MPI_Info_create(info, error)
call MPI_Info_set(info,"IBM_largeblock_io","true", error)
call MPI_Info_set(info,"stripping_unit","4194304", error)
CALL MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error)
CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error)
CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error)
CALL MPI_INFO_SET(info,"romio_ds_read","disable", error)
CALL MPI_INFO_SET(info,"romio_ds_write","disable", error)
CALL MPI_INFO_SET(info,"romio_cb_write","enable", error)
CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error)

For the moment, problem solved. Thanks a lot,

···

--
Ángel de Vicente
http://www.iac.es/galeria/angelv/
---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci�n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

thanks. I'm not sure if these could be tuned a bit better, but with the
following hints the problem is all gone in the two problematic clusters
(for a given file size, one of the writing modes of the program was
taking about ~200x more time. With these hints all is back to normal,
and the problematic mode takes just the same time as the other ones).

You can pass anything you want for the "key": implementations will ignore hints they do not understand. For the sake of anyone googling in the future, I will explain what, if anything, the hints you passed in do:

call MPI_Info_create(info, error)
call MPI_Info_set(info,"IBM_largeblock_io","true", error)

this hint is useful for IBM PE platforms and tells GPFS you are about to do large I/O. Over time, this hint will become less useful: IBM is moving away from their own MPI-IO implementation and incorporating ROMIO.

call MPI_Info_set(info,"stripping_unit","4194304", error)

this one is probably the biggest help. In Collective I/O, ROMIO splits up the file into "file domains" (and assigns those domains to a subset of processors called I/O aggregators). When the "striping_unit" hint is set, ROMIO will align those file domains to that striping_unit.

Sometimes, like on Blue Gene, ROMIO will detect the file system block size for you, and this hint is not needed. No harm in providing it, though.

CALL MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error)

I don't think this hint does anything.

CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error)
CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error)
CALL MPI_INFO_SET(info,"romio_ds_read","disable", error)
CALL MPI_INFO_SET(info,"romio_ds_write","disable", error)

No harm here, but if you are going to disable data sieving (romio_ds_read and romio_ds_write) then there's no reason to tweak the independent read and write buffer sizes.

CALL MPI_INFO_SET(info,"romio_cb_write","enable", error)

On many platforms (but not Blue Gene), romio will look at the access pattern. If the pattern is not interleaved, ROMIO will not use collective buffering. At today's scale, collective buffering is almost always a win, especially on GPFS when combined with striping_unit.

CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error)

this buffer size might actually be a bit small, depending on how much data you are writing/reading. If you have memory to spare, increasing this value is often a good way to improve performance.

For the moment, problem solved. Thanks a lot,

tuning these stacks honestly way harder than it should be. thanks for your persistence.

==rob

···

On 11/13/2014 06:34 AM, Angel de Vicente wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Rob,

I found your explanation very helpful, at least to me.
Are there documents listing all recognized hints by IBM and/or ROMIO?
Also, can the xxx_size hints recognize something like “40MB” instead of “41943040”?
(Of course, there is this ambiguity whether MB means 2^20 or 10^6.)

-Albert Cheng

···

On Nov 13, 2014, at 9:25 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On 11/13/2014 06:34 AM, Angel de Vicente wrote:

thanks. I'm not sure if these could be tuned a bit better, but with the
following hints the problem is all gone in the two problematic clusters
(for a given file size, one of the writing modes of the program was
taking about ~200x more time. With these hints all is back to normal,
and the problematic mode takes just the same time as the other ones).

You can pass anything you want for the "key": implementations will ignore hints they do not understand. For the sake of anyone googling in the future, I will explain what, if anything, the hints you passed in do:

call MPI_Info_create(info, error)
call MPI_Info_set(info,"IBM_largeblock_io","true", error)

this hint is useful for IBM PE platforms and tells GPFS you are about to do large I/O. Over time, this hint will become less useful: IBM is moving away from their own MPI-IO implementation and incorporating ROMIO.

call MPI_Info_set(info,"stripping_unit","4194304", error)

this one is probably the biggest help. In Collective I/O, ROMIO splits up the file into "file domains" (and assigns those domains to a subset of processors called I/O aggregators). When the "striping_unit" hint is set, ROMIO will align those file domains to that striping_unit.

Sometimes, like on Blue Gene, ROMIO will detect the file system block size for you, and this hint is not needed. No harm in providing it, though.

CALL MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error)

I don't think this hint does anything.

CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error)
CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error)
CALL MPI_INFO_SET(info,"romio_ds_read","disable", error)
CALL MPI_INFO_SET(info,"romio_ds_write","disable", error)

No harm here, but if you are going to disable data sieving (romio_ds_read and romio_ds_write) then there's no reason to tweak the independent read and write buffer sizes.

CALL MPI_INFO_SET(info,"romio_cb_write","enable", error)

On many platforms (but not Blue Gene), romio will look at the access pattern. If the pattern is not interleaved, ROMIO will not use collective buffering. At today's scale, collective buffering is almost always a win, especially on GPFS when combined with striping_unit.

CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error)

this buffer size might actually be a bit small, depending on how much data you are writing/reading. If you have memory to spare, increasing this value is often a good way to improve performance.

For the moment, problem solved. Thanks a lot,

tuning these stacks honestly way harder than it should be. thanks for your persistence.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Rob,

I found your explanation very helpful, at least to me.
Are there documents listing all recognized hints by IBM and/or ROMIO?

Well... sort of

IBM hints: (wow, these are hard to google! -- here's an older set of documentation)
http://www-01.ibm.com/support/knowledgecenter/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe500.doc/am107_ifopen.htm?lang=en

Cray hints: the intro_mpi man page on whatever system you are on is the authoritative one, but you can find web copies of older versions like this one:

https://fs.hlrs.de/projects/craydoc/docs/man/xe_mptm/51/cat3/intro_mpi.3.html

ROMIO hints:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

OpenMPI hints: same as ROMIO hints, unless you are using OMPIO in which case I don't think any hints are supported (OMPIO uses MCA parameters)

> Also, can the xxx_size hints recognize something like “40MB” instead of “41943040”?
> (Of course, there is this ambiguity whether MB means 2^20 or 10^6.)

that certainly seems like a nice usability enhancement. I think some of the software engineering I did a couple years ago should make this easier to implement... but it's probably not a huge priority, sorry.

https://trac.mpich.org/projects/mpich/ticket/2197

==rob

···

On 11/13/2014 09:55 AM, Albert Cheng wrote:

-Albert Chen

On Nov 13, 2014, at 9:25 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On 11/13/2014 06:34 AM, Angel de Vicente wrote:

thanks. I'm not sure if these could be tuned a bit better, but with the
following hints the problem is all gone in the two problematic clusters
(for a given file size, one of the writing modes of the program was
taking about ~200x more time. With these hints all is back to normal,
and the problematic mode takes just the same time as the other ones).

You can pass anything you want for the "key": implementations will ignore hints they do not understand. For the sake of anyone googling in the future, I will explain what, if anything, the hints you passed in do:

call MPI_Info_create(info, error)
call MPI_Info_set(info,"IBM_largeblock_io","true", error)

this hint is useful for IBM PE platforms and tells GPFS you are about to do large I/O. Over time, this hint will become less useful: IBM is moving away from their own MPI-IO implementation and incorporating ROMIO.

call MPI_Info_set(info,"stripping_unit","4194304", error)

this one is probably the biggest help. In Collective I/O, ROMIO splits up the file into "file domains" (and assigns those domains to a subset of processors called I/O aggregators). When the "striping_unit" hint is set, ROMIO will align those file domains to that striping_unit.

Sometimes, like on Blue Gene, ROMIO will detect the file system block size for you, and this hint is not needed. No harm in providing it, though.

CALL MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error)

I don't think this hint does anything.

CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error)
CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error)
CALL MPI_INFO_SET(info,"romio_ds_read","disable", error)
CALL MPI_INFO_SET(info,"romio_ds_write","disable", error)

No harm here, but if you are going to disable data sieving (romio_ds_read and romio_ds_write) then there's no reason to tweak the independent read and write buffer sizes.

CALL MPI_INFO_SET(info,"romio_cb_write","enable", error)

On many platforms (but not Blue Gene), romio will look at the access pattern. If the pattern is not interleaved, ROMIO will not use collective buffering. At today's scale, collective buffering is almost always a win, especially on GPFS when combined with striping_unit.

CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error)

this buffer size might actually be a bit small, depending on how much data you are writing/reading. If you have memory to spare, increasing this value is often a good way to improve performance.

For the moment, problem solved. Thanks a lot,

tuning these stacks honestly way harder than it should be. thanks for your persistence.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA