Tracing pHDF5's MPI-IO calls

Hello,

I would like to trace the IO calls of my application, which uses pHDF5. I'm
looking for a way to retrieve the set of (rank,file,offset,size) quadruplets
representing the list of elementary access resulting from the IO phase of
the application. Is it possible to configure pHDF5 so it provides such
trace?
Thank you,

Matthieu Dorier

···

--
Matthieu Dorier
ENS Cachan, antenne de Bretagne
Département informatique et télécommunication
http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

It may also be possible to use MPE to trace the MPI-IO calls, but I've
never tried that route.

Mark

···

On Fri, Mar 4, 2011 at 9:20 AM, Matthieu Dorier <Matthieu.Dorier@eleves.bretagne.ens-cachan.fr> wrote:

Hello,

I would like to trace the IO calls of my application, which uses pHDF5. I'm
looking for a way to retrieve the set of (rank,file,offset,size) quadruplets
representing the list of elementary access resulting from the IO phase of
the application. Is it possible to configure pHDF5 so it provides such
trace?
Thank you,

Matthieu Dorier

--
Matthieu Dorier
ENS Cachan, antenne de Bretagne
Département informatique et télécommunication
http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

It takes some time to get used to the Jumpshot MPE viewer, but it's
pretty cool. You won't get rank,file,offset,size, though. You get,
essentially, (rank,call,time,duration).

==rob

···

On Fri, Mar 04, 2011 at 10:09:14AM -0500, Mark Howison wrote:

It may also be possible to use MPE to trace the MPI-IO calls, but I've
never tried that route.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Matthieu Dorier was asking for a tuple of (rank,file,offset,size).

I guess this really belongs on the ipm-hpc-help list, but IPM doesn't
actually give you the offset information. It wraps fseek(3) but HDF5
using MPI-IO is probably going to call lseek(2), lseek64(2) some other
seek-like system call.

IPM is pretty close, giving the file, size, and a timestamp all tucked
into a file-per-rank.

==rob

···

On Fri, Mar 04, 2011 at 10:09:14AM -0500, Mark Howison wrote:

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Thank you, I didn't know IPM v2 was released. It will be very helpful.

Matthieu

···

2011/3/4 Mark Howison <mark.howison@gmail.com>

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

It may also be possible to use MPE to trace the MPI-IO calls, but I've
never tried that route.

Mark

On Fri, Mar 4, 2011 at 9:20 AM, Matthieu Dorier > <Matthieu.Dorier@eleves.bretagne.ens-cachan.fr> wrote:
> Hello,
>
> I would like to trace the IO calls of my application, which uses pHDF5.
I'm
> looking for a way to retrieve the set of (rank,file,offset,size)
quadruplets
> representing the list of elementary access resulting from the IO phase of
> the application. Is it possible to configure pHDF5 so it provides such
> trace?
> Thank you,
>
>
> Matthieu Dorier
>
> --
> Matthieu Dorier
> ENS Cachan, antenne de Bretagne
> Département informatique et télécommunication
> http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>

--
Matthieu Dorier
ENS Cachan, antenne de Bretagne
Département informatique et télécommunication
http://perso.eleves.bretagne.ens-cachan.fr/~mdori307/wiki/

Hi all,

···

On Mar 4, 2011, at 2:22 PM, Rob Latham wrote:

On Fri, Mar 04, 2011 at 10:09:14AM -0500, Mark Howison wrote:

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

Matthieu Dorier was asking for a tuple of (rank,file,offset,size).

I guess this really belongs on the ipm-hpc-help list, but IPM doesn't
actually give you the offset information. It wraps fseek(3) but HDF5
using MPI-IO is probably going to call lseek(2), lseek64(2) some other
seek-like system call.

IPM is pretty close, giving the file, size, and a timestamp all tucked
into a file-per-rank.

  We've got a small project currently in the works that gives a minimal amount of information back to the application: whether a collective I/O write/read operation completed as a collective, or was broken down into an independent operation (or some combination of those two, for chunked datasets); which should help some. I don't think we've got direct funding for more effort in this direction currently, but I'd sure like to roll it into a new set of funding (or work with someone who feels like submitting a patch for this idea).

  Quincey

Hi Rob, it is true that you can't easily see the offset with IPM,
unless you follow the sequence of writes or look for an fseek command.
I did have an earlier beta of IPM that wrapped lseek and lseek64, but
it looks like those did not make their way into the 2.0 beta. Also, an
important step I forgot to mention in my original email is that you
also have to specify a list of "wraps" when you link your application,
like this:

WRAPS = -Wl,-wrap,fopen,-wrap,fdopen,-wrap,freopen,-wrap,fclose,-wrap,fflush,-wrap,fread,-wrap,fwrite,-wrap,fseek,-wrap,ftell,-wrap,rewind,-wrap,fgetpos,-wrap,fsetpos,-wrap,fgetc,-wrap,getc,-wrap,ungetc,-wrap,read,-wrap,write,-wrap,open,-wrap,open64,-wrap,creat,-wrap,close,-wrap,truncate,-wrap,ftruncate,-wrap,truncate64,-wrap,ftruncate64
-Wl,-wrap,fopen,-wrap,fdopen,-wrap,freopen,-wrap,fclose,-wrap,fflush,-wrap,fread,-wrap,fwrite,-wrap,fseek,-wrap,ftell,-wrap,rewind,-wrap,fgetpos,-wrap,fsetpos,-wrap,fgetc,-wrap,getc,-wrap,ungetc,-wrap,read,-wrap,write,-wrap,open,-wrap,open64,-wrap,creat,-wrap,close,-wrap,truncate,-wrap,ftruncate,-wrap,truncate64,-wrap,ftruncate64

(link line ...) -lipm $(WRAPS)

The wraps show the full set of POSIX functions that IPM will trace.

Mark

···

On Fri, Mar 4, 2011 at 3:22 PM, Rob Latham <robl@mcs.anl.gov> wrote:

On Fri, Mar 04, 2011 at 10:09:14AM -0500, Mark Howison wrote:

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

Matthieu Dorier was asking for a tuple of (rank,file,offset,size).

I guess this really belongs on the ipm-hpc-help list, but IPM doesn't
actually give you the offset information. It wraps fseek(3) but HDF5
using MPI-IO is probably going to call lseek(2), lseek64(2) some other
seek-like system call.

IPM is pretty close, giving the file, size, and a timestamp all tucked
into a file-per-rank.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Is it possible that having 30,000 text files being written could actually
affect timings when trying to ascertain what's going on with I/O? If so, is
there any way around this?

Leigh

···

On Fri, Mar 4, 2011 at 8:09 AM, Mark Howison <mark.howison@gmail.com> wrote:

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Quincey, a tracing feature in HDF5 would be pretty helpful, and
easier to use than IPM. There is a similar feature available in the
Cray MPI-IO library,where you can set the environment variables

MPICH_MPIIO_XSTATS

to 1 or 2 to get detailed output of how the data is aggregated and
written from the the CB nodes (there's more information available from
the Cray document here: http://docs.cray.com/books/S-0013-10/).

But this wouldn't report on chunking or independent I/O through HDF5.

Mark

···

On Mon, Mar 7, 2011 at 11:07 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi all,

On Mar 4, 2011, at 2:22 PM, Rob Latham wrote:

On Fri, Mar 04, 2011 at 10:09:14AM -0500, Mark Howison wrote:

Hi Matthieu,

The Integrated Performance Monitor (IPM) v2 beta has a POSIX I/O
tracing feature. This will give you detailed output of the underlying
POSIX calls (such as open, write and read) made by your application
(through the pHDF layer). You can download it here:

http://tools.pub.lab.nm.ifi.lmu.de/web/ipm/

To enable I/O tracing, you have to configure with

./configure --enable-posixio CFLAGS=-DHAVE_POSIXIO_TRACE

You have to relink your application against the libipm.a that his
produces (or you can enable the shared library and do an LD_PRELOAD).
After you application runs, you'll have a text file for each MPI rank
with the POSIX calls and their arguments.

Matthieu Dorier was asking for a tuple of (rank,file,offset,size).

I guess this really belongs on the ipm-hpc-help list, but IPM doesn't
actually give you the offset information. It wraps fseek(3) but HDF5
using MPI-IO is probably going to call lseek(2), lseek64(2) some other
seek-like system call.

IPM is pretty close, giving the file, size, and a timestamp all tucked
into a file-per-rank.

   We&#39;ve got a small project currently in the works that gives a minimal amount of information back to the application: whether a collective I/O write/read operation completed as a collective, or was broken down into an independent operation \(or some combination of those two, for chunked datasets\); which should help some\.  I don&#39;t think we&#39;ve got direct funding for more effort in this direction currently, but I&#39;d sure like to roll it into a new set of funding \(or work with someone who feels like submitting a patch for this idea\)\.

   Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

It's the classic tradeoff: you can have a lightweight tracing approach
that generates summaries of the behavior or you can record every
operation (and potentially perturb the results).

the Argonne 'darshan' project might give enough of a big picture
summary, but it was designed foremost to be lightweight, not
exhaustive:

http://press.mcs.anl.gov/darshan/

···

On Wed, Apr 06, 2011 at 02:01:17PM -0600, Leigh Orf wrote:

> You have to relink your application against the libipm.a that his
> produces (or you can enable the shared library and do an LD_PRELOAD).
> After you application runs, you'll have a text file for each MPI rank
> with the POSIX calls and their arguments.

Is it possible that having 30,000 text files being written could actually
affect timings when trying to ascertain what's going on with I/O? If so, is
there any way around this?

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

> > You have to relink your application against the libipm.a that his
> > produces (or you can enable the shared library and do an LD_PRELOAD).
> > After you application runs, you'll have a text file for each MPI rank
> > with the POSIX calls and their arguments.
>
> Is it possible that having 30,000 text files being written could actually
> affect timings when trying to ascertain what's going on with I/O? If so,
is
> there any way around this?

It's the classic tradeoff: you can have a lightweight tracing approach
that generates summaries of the behavior or you can record every
operation (and potentially perturb the results).

I was hoping perhaps that writes were buffered, and since the files are
small, performance might not be impacted beyond opening the file and
flushing at the end. So far as I know, there is no way to profile the
profiling software with the profiling software!

the Argonne 'darshan' project might give enough of a big picture
summary, but it was designed foremost to be lightweight, not
exhaustive:

http://press.mcs.anl.gov/darshan/

Thank you, I will check it out.

Leigh

···

On Wed, Apr 6, 2011 at 2:39 PM, Rob Latham <robl@mcs.anl.gov> wrote:

On Wed, Apr 06, 2011 at 02:01:17PM -0600, Leigh Orf wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

I've found that the overhead from writing to the trace files isn't
usually noticeable unless you have a pathological case where there are
many read/write operations with small amounts of data. For instance,
if you have a case where you intend to do 1MB writes, but they get
broken down into 4KB writes and 256 as many, the overhead is bad.

There could also be some overhead associated with opening 30K files,
but this should occur during MPI_Init, so you can easily exclude it
from any timings you are doing by starting your timer after MPI_Init
(which you would have to do anyway if you are using MPI_Wtime).

Mark

···

On Wed, Apr 6, 2011 at 6:49 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

On Wed, Apr 6, 2011 at 2:39 PM, Rob Latham <robl@mcs.anl.gov> wrote:

On Wed, Apr 06, 2011 at 02:01:17PM -0600, Leigh Orf wrote:
> > You have to relink your application against the libipm.a that his
> > produces (or you can enable the shared library and do an LD_PRELOAD).
> > After you application runs, you'll have a text file for each MPI rank
> > with the POSIX calls and their arguments.
>
> Is it possible that having 30,000 text files being written could
> actually
> affect timings when trying to ascertain what's going on with I/O? If so,
> is
> there any way around this?

It's the classic tradeoff: you can have a lightweight tracing approach
that generates summaries of the behavior or you can record every
operation (and potentially perturb the results).

I was hoping perhaps that writes were buffered, and since the files are
small, performance might not be impacted beyond opening the file and
flushing at the end. So far as I know, there is no way to profile the
profiling software with the profiling software!

the Argonne 'darshan' project might give enough of a big picture
summary, but it was designed foremost to be lightweight, not
exhaustive:

http://press.mcs.anl.gov/darshan/

Thank you, I will check it out.

Leigh

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org