HDF5 : how to get good performance ?

In short : are there things to know / make sure of / be aware of to get good performance with P-HDF5 ?

My understanding is that "before" HDF5 / P-HDF5, each process of a MPI code was compelled to write / read sequential (and separate) files : this was stressing the file system, and, perfomance was poor (I make it short). My understanding is that P-HDF5 (and MPI-IO) were designed to cope with this problem (MPI-IO is more "low-level", P-HDF5 is built on top of MPI-IO and offers flexibility and ways to structure easily the file).

To test this I wrote a MPI code. There are N MPI processus. For each MPI process, data to write are just an (1D) array of integers (initialised with the rank of the MPI process). In the MPI / HDF5 file, one target to write these data by bunchs one after the other (ordered bunchs). First each MPI process write a file sequentially (each MPI process writes his own sequential file : one get N separate files at the end) : the function used to write is "write" (binary). Then, I use MPI-IO, each process writes the same bunch of data in one file (containing all data, bunch after bunch, from all processus - one get only one file at the end) : the function used to write is MPI_File_write. Then, I use P-HDF5 to do the same thing than with MPI-IO (so one get only one file at the end) : the function used to write is H5Dwrite. I expected to get better performance with MPI-IO and P-HDF5 than with the sequential approach. The spirit of this test code is very simple / basic (each MPI process writes his own block of data in the same file, or, in separate files in the sequential approach).
Note : in each case (sequential, MPI-IO, P-HDF5), when I say "write data in file", I mean writing big blocks / bunch of data at once (I do not write data one by one - I write the biggest block of data, but smaller than 2Gb, that is possible to write).
Note : I tried with N = 1, 2, 4, 8, 16.
Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb to 16 Gb (which looks like a "very big" file to me).
Note : I followed the P-HDF5 documentation (use H5P_FILE_ACCESS and H5P_DATASET_XFER property list + use hyperslab "by chunks")
Note : the file system is "GPFS" (it has been installed by the cluster vendor : this is supposed to be ready to get performance out of P-HDF5 - I am an "application" guy that try to use HDF5, I am not an "admin sys" that would be familiar with complex related stuffs related to the file system)
Note : I compiled the HDF5 package like this "./configure --enable-parallel".
Note : I use CentOS + GNU compilers (for both HDF5 package and my test code) + hdf5-1.8.13
Note : I use mpic++ (not h5pxx compilers - actually I didn't get why HDF5 provides compilers) to compile my test code, is this a problem ?

The problem is : in all cases, I get always (sequential, MPI-IO, HDF5) the same performances : using P-HDF5 or MPI-IO seems useless !?... And I don't get why : this seems not logical to me. I expected to get some improvements when using P-HDF5 / MPI-IO. I expected to get something like this : http://www.speedup.ch/workshops/w37_2008/HDF5-Tutorial-PDF/PSI-HDF5-PARALLEL.pdf (slide 30). For instance I get :
============================ mpirun -n 8 ./tstIO.exe --dataSize 536870912 ============================
INFO : data block = 536870912 integers per MPI proc X 8 MPI procs = 16384 Mb = 16 Gb
Seq. : write time = 5.2239 sec, read time = 8.5331 sec
MPI-IO : write time = 6.3301 sec, read time = 5.9788 sec
P-HDF5 : write time = 5.9695 sec, read time = 6.1289 sec
============================ mpirun -n 8 ./tstIO.exe --dataSize 1073741824 ============================
INFO : data block = 1073741824 integers per MPI proc X 8 MPI procs = 32768 Mb = 32 Gb
Seq. : write time = 10.426 sec, read time = 14.353 sec
MPI-IO : write time = 11.305 sec, read time = 11.908 sec
P-HDF5 : write time = 10.886 sec, read time = 16.943 sec

I understand I can not get a clear answer to my question as it is not sharp enough. I try to post this to get some clue to get some logic out of the behavior I observe. Does the "code spirit" (compare sequnetial, MPI-IO, P-HDF5 the way I do it) can not enable to see P-HDF5 performance ? If yes, why ("too simple" data set ? data should be gathered at master side before to be written collectivelly) ? How to change this ? Should I look for a problem in the file system ? Or somewhere else ? Missing option(s) when configuring the HDF5 package ? Are there others things I should or can check to be sure I am in a situation where I can get performance out of P-HDF5 ? Are there HDF5 tools / tutorial / benchmark (I could replay) designed to check for performance ? [just before to send this mail, I heard about h5perf : I ran h5perf over 4 MPI processus, I attached the log]

Any relevant clue / information would be appreciated. If what I observe is logical I would just understand why, and, how / when it is possible to get performance out of P-HDF5. I just would like to get some logic out of this.

Thanks for help,

FH

PS : I can give more information and the code, if needed (?)

h5perf.log (15.1 KB)

In short : are there things to know / make sure of / be aware of to get
good performance with P-HDF5 ?

- turn on collective I/O. it's not enabled by default

- HDF5 metadata might be a factor if you have very many small datasets, but for most applications it's not important

- consult your MPI library for any file-system specific tuning you might be able to do. For example, Intel-MPI needs you to set an environment variable before it will use any of the GPFS or Panasas optimizations it has written.

- be mindful of type conversions: if your data in memory is a 4-byte float, but they are 8-byte doubles on disk, HDF5 will "break collective" and do that I/O independently.

To test this I wrote a MPI code. ... I expected to get better
performance with MPI-IO and P-HDF5 than with the sequential approach.
The spirit of this test code is very simple / basic (each MPI process
writes his own block of data in the same file, or, in separate files in
the sequential approach).

Note : in each case (sequential, MPI-IO, P-HDF5), when I say "write data
in file", I mean writing big blocks / bunch of data at once (I do not
write data one by one - I write the biggest block of data, but smaller
than 2Gb, that is possible to write).
Note : I tried with N = 1, 2, 4, 8, 16.

in 2014, 16 is not very parallel. serial I/O has many benefits at modest levels of parallelism: caching, mostly.

Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb to
16 Gb (which looks like a "very big" file to me).

that's adequate, yes

Note : I followed the P-HDF5 documentation (use H5P_FILE_ACCESS and
H5P_DATASET_XFER property list + use hyperslab "by chunks")
Note : the file system is "GPFS" (it has been installed by the cluster
vendor : this is supposed to be ready to get performance out of P-HDF5 -
I am an "application" guy that try to use HDF5, I am not an "admin sys"
that would be familiar with complex related stuffs related to the file
system)

Now we are getting somewhere.

Note : I compiled the HDF5 package like this "./configure
--enable-parallel".
Note : I use CentOS + GNU compilers (for both HDF5 package and my test
code) + hdf5-1.8.13
Note : I use mpic++ (not h5pxx compilers - actually I didn't get why
HDF5 provides compilers) to compile my test code, is this a problem ?

just makes it easier to pick up any libraries needed. I don't use the wrappers, either, which means sometimes I need to figure out what new library (like -ldl) HDF5 needs.

Any relevant clue / information would be appreciated. If what I observe
is logical I would just understand why, and, how / when it is possible
to get performance out of P-HDF5. I just would like to get some logic
out of this.

If you are using GPFS, there is one optimization that goes a long way towards improving performance: aligning writes to file system block boundaries. See this email from a few weeks ago:

http://mail.lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2014-July/007963.html

==rob

···

On 08/08/2014 03:27 AM, houssen wrote:

Thanks for help,

FH

PS : I can give more information and the code, if needed (?)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

After a lot of testing, my understanding is that :
1. if the data
(that belong to each MPI proc.) must be interleaved in the file, then
P-HDF5 (and MPI-IO) can reduce significantly the elapsed time spent for
IO
2. if not (independent data written independently by each MPI proc.),
then P-HDF5 / MPI-IO / sequential approaches are equivalent

A
posteriori, this seems logical to me. Are there other situations where
HDF5 may improve the IO speed-up (reduce elapsed time) ?

Franck

Le
2014-08-08 17:26, Rob Latham a écrit :

In short : are there things to know / make sure of / be aware

of to get

good performance with P-HDF5 ?

- turn on collective

I/O. it's not enabled by default

- HDF5 metadata might be a factor

if you have very many small

datasets, but for most applications it's

not important

- consult your MPI library for any file-system

specific tuning you

might be able to do. For example, Intel-MPI needs

you to set an

environment variable before it will use any of the GPFS

or Panasas

optimizations it has written.

- be mindful of type

conversions: if your data in memory is a 4-byte

float, but they are

8-byte doubles on disk, HDF5 will "break

collective" and do that I/O

independently.

To test this I wrote a MPI code. ... I expected

to get better

performance with MPI-IO and P-HDF5 than with the

sequential approach.

The spirit of this test code is very simple /

basic (each MPI process

writes his own block of data in the same

file, or, in separate files in

the sequential approach).

Note :

in each case (sequential, MPI-IO, P-HDF5), when I say "write data

in

file", I mean writing big blocks / bunch of data at once (I do not

write data one by one - I write the biggest block of data, but
smaller

than 2Gb, that is possible to write).
Note : I tried with

N = 1, 2, 4, 8, 16.

in 2014, 16 is not very parallel. serial I/O

has many benefits at

modest levels of parallelism: caching, mostly.

Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb

to

16 Gb (which looks like a "very big" file to me).

that's

adequate, yes

Note : I followed the P-HDF5 documentation (use

H5P_FILE_ACCESS and

H5P_DATASET_XFER property list + use hyperslab

"by chunks")

Note : the file system is "GPFS" (it has been installed

by the cluster

vendor : this is supposed to be ready to get

performance out of P-HDF5 -

I am an "application" guy that try to use

HDF5, I am not an "admin sys"

that would be familiar with complex

related stuffs related to the file

system)

Now we are getting

somewhere.

Note : I compiled the HDF5 package like this

"./configure

--enable-parallel".
Note : I use CentOS + GNU

compilers (for both HDF5 package and my test

code) + hdf5-1.8.13

Note : I use mpic++ (not h5pxx compilers - actually I didn't get why

HDF5 provides compilers) to compile my test code, is this a problem ?

just makes it easier to pick up any libraries needed. I don't use

the wrappers, either, which means sometimes I need to figure out what

new library (like -ldl) HDF5 needs.

Any relevant clue /

information would be appreciated. If what I observe

is logical I

would just understand why, and, how / when it is possible

to get

performance out of P-HDF5. I just would like to get some logic

out of

this.

If you are using GPFS, there is one optimization that goes a

long way

towards improving performance: aligning writes to file system

block

boundaries. See this email from a few weeks ago:

http://mail.lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2014-July/007963.html

==rob

Thanks for help,

FH

PS : I can give more

information and the code, if needed (?)

···

On 08/08/2014 03:27 AM, houssen wrote:

_______________________________________________

Hdf-forum is for HDF

software users discussion.

Hdf-forum@lists.hdfgroup.org

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

After a lot of testing, my understanding is that :
1. if the data (that belong to each MPI proc.) must be interleaved in the file, then P-HDF5 (and MPI-IO) can reduce significantly the elapsed time spent for IO
2. if not (independent data written independently by each MPI proc.), then P-HDF5 / MPI-IO / sequential approaches areequivalent

A posteriori, this seems logical to me. Are there other situations where HDF5 may improve the IO speed-up (reduce elapsed time) ?

Yes. Consider a system like Blue Gene, with very many MPI processes and not very many I/O servers.

Collective I/O (P-HDF5) will give you two additional benefits, even if the data is non-interleaved:

- coalescing requests down to a subset of processes (the i/o aggregators). Instead of a quarter million MPI clients hitting the file sytem, maybe that is reduced to a thousand, say.

- some file-system aware optimization. For GPFS, writes should be aligned to the file system block boundary. For Lustre, writes should be done in a group-cyclic distribution so that an MPI I/O aggregator only ever speaks to one I/O server (but remember there are going to be on the order of a hundred or a thousand of these aggregators, so overall there is parallelism and improved observed bandwidth).

==rob

···

On 09/01/2014 08:26 AM, houssen wrote:

Franck

Le 2014-08-08 17:26, Rob Latham a �crit :

On 08/08/2014 03:27 AM, houssen wrote:

In short : are there things to know / make sure of / be aware of to get
good performance with P-HDF5 ?

- turn on collective I/O. it's not enabled by default

- HDF5 metadata might be a factor if you have very many small
datasets, but for most applications it's not important

- consult your MPI library for any file-system specific tuning you
might be able to do. For example, Intel-MPI needs you to set an
environment variable before it will use any of the GPFS or Panasas
optimizations it has written.

- be mindful of type conversions: if your data in memory is a 4-byte
float, but they are 8-byte doubles on disk, HDF5 will "break
collective" and do that I/O independently.

To test this I wrote a MPI code. ... I expected to get better
performance with MPI-IO and P-HDF5 than with the sequential approach.
The spirit of this test code is very simple / basic (each MPI process
writes his own block of data in the same file, or, in separate files in
the sequential approach).

Note : in each case (sequential, MPI-IO, P-HDF5), when I say "write data
in file", I mean writing big blocks / bunch of data at once (I do not
write data one by one - I write the biggest block of data, but smaller
than 2Gb, that is possible to write).
Note : I tried with N = 1, 2, 4, 8, 16.

in 2014, 16 is not very parallel. serial I/O has many benefits at
modest levels of parallelism: caching, mostly.

Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb to
16 Gb (which looks like a "very big" file to me).

that's adequate, yes

Note : I followed the P-HDF5 documentation (use H5P_FILE_ACCESS and
H5P_DATASET_XFER property list + use hyperslab "by chunks")
Note : the file system is "GPFS" (it has been installed by the cluster
vendor : this is supposed to be ready to get performance out of P-HDF5 -
I am an "application" guy that try to use HDF5, I am not an "admin sys"
that would be familiar with complex related stuffs related to the file
system)

Now we are getting somewhere.

Note : I compiled the HDF5 package like this "./configure
--enable-parallel".
Note : I use CentOS + GNU compilers (for both HDF5 package and my test
code) + hdf5-1.8.13
Note : I use mpic++ (not h5pxx compilers - actually I didn't get why
HDF5 provides compilers) to compile my test code, is this a problem ?

just makes it easier to pick up any libraries needed. I don't use
the wrappers, either, which means sometimes I need to figure out what
new library (like -ldl) HDF5 needs.

Any relevant clue / information would be appreciated. If what I observe
is logical I would just understand why, and, how / when it is possible
to get performance out of P-HDF5. I just would like to get some logic
out of this.

If you are using GPFS, there is one optimization that goes a long way
towards improving performance: aligning writes to file system block
boundaries. See this email from a few weeks ago:

http://mail.lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2014-July/007963.html

==rob

Thanks for help,

FH

PS : I can give more information and the code, if needed (?)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA