Collective IO question

Please redirect me to a more appropriate list/forum if this question is not of interest...

When writing data using collective IO, I have N processes participating, but occasionally, one or more processes has an empty dataset and nothing to write.

Due to the way the code is structured, empty processes do not know the type of the data that other processes are writing. It may be a list of float/double/int/long etc etc, usually, the processes check the type using C++ rtti and then do the write.

Processes with no data - cannot query the information, and so can't send a NULL pointer with zero size because I don't know which routine is being called (which datatype)

By way of example, we have a series of templates, which use macros so that for any datatype, we end up here ...

// A Convenience Macro which does what we want for any dataset type
// note that we use a dummy parameter and the ## macro operator to delay expansion
// of the T2 parameter which causes problems if we don't
#define WriteDataArray(null, T2, f, name, dataarray) \
  sprintf(typestring, "%s", #T2); \
  dataset = H5Dcreate(f->timegroup, name.c_str(), null##T2, f->shape, H5P_DEFAULT); \
  if (dataset<0) { \
    ErrorMacro(<<"Dataset create failed for " << name.c_str() \
    << " Timestep " << f->timestep \
    << " Shape " << f->shape \
    << " Data Type " << #T2); \
    r = -1; \
  } else { \
    void *dataptr = dataarray->GetVoidPointer(0); \
    dataptr = dataptr ? dataptr : &buffer[0]; \
    r = H5Dwrite(dataset, T2, memshape, diskshape, H5P_DEFAULT, dataptr); \
  } \
  H5Dclose(dataset);

And when calling H5Dcreate - we do not know what datatype we have ended up with (result of the ##macro expansion.

I realize that this is a bit strange, but it does work - providing all processes have something to write!

Is there any way for a process to simply skip being part of the IO when it has nothing? I don't want to create a new MPI_Group or anything, but tell proc N (or rather the controller that is waiting for something from it) that proc N is not sending anything.

When using independent IO, I do not need to worry, but collective causes trouble.

Apologies if I have made a mistake or explained badly, I'm revisiting some old code after a bit of time away ...

thanks

JB

···

--
John Biddiscombe, email:biddisco @ cscs.ch

CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

When writing data using collective IO, I have N processes participating,
but occasionally, one or more processes has an empty dataset and nothing
to write.

Due to the way the code is structured, empty processes do not know the
type of the data that other processes are writing. It may be a list of
float/double/int/long etc etc, usually, the processes check the type
using C++ rtti and then do the write.

Processes with no data - cannot query the information, and so can't send
a NULL pointer with zero size because I don't know which routine is
being called (which datatype)

OK, up to this point I was all set to tell you about H5Sselect_none(),
but I guess that won't work for you...

By way of example, we have a series of templates, which use macros so
that for any datatype, we end up here ...

// A Convenience Macro which does what we want for any dataset type
// note that we use a dummy parameter and the ## macro operator to delay
expansion
// of the T2 parameter which causes problems if we don't
#define WriteDataArray(null, T2, f, name, dataarray) \
sprintf(typestring, "%s", #T2); \
dataset = H5Dcreate(f->timegroup, name.c_str(), null##T2, f->shape,
H5P_DEFAULT); \
if (dataset<0) { \
   ErrorMacro(<<"Dataset create failed for " << name.c_str() \
   << " Timestep " << f->timestep \
   << " Shape " << f->shape \
   << " Data Type " << #T2); \
   r = -1; \
} else { \
   void *dataptr = dataarray->GetVoidPointer(0); \
   dataptr = dataptr ? dataptr : &buffer[0]; \
   r = H5Dwrite(dataset, T2, memshape, diskshape, H5P_DEFAULT, dataptr); \
} \
H5Dclose(dataset);

Usually when people use collective I/O and HDF5, there is one big
global dataset, and the processes doing I/O define a hyperslab so that
each process is reading/writing a particular (maybe even overlapping)
region of that dataset.

In your case.. you are having each processor create a dataset for its
data. Do I understand correctly?

And when calling H5Dcreate - we do not know what datatype we have ended
up with (result of the ##macro expansion.

I realize that this is a bit strange, but it does work - providing all
processes have something to write!

I don't doubt that this works, but I don't think you're actually
seeing any collective I/O benefit with this workload. One of the
largest benefits to collective I/O is that when you write out a small
region of a multidimensional array (a noncontiguous access), the
MPI-I/O library will re-arrange the accesses among all processors to
make the actual I/O more friendly to the i/o subsystem.

If you are running on a BlueGene system, though, then collective I/O
is always a win. it's just less of a win with this sort of workload.

Is there any way for a process to simply skip being part of the IO when
it has nothing? I don't want to create a new MPI_Group or anything, but
tell proc N (or rather the controller that is waiting for something from
it) that proc N is not sending anything.

it's actually more complicated than that.. you'd have to create a
whole new communicator, not just a group :>

When using independent IO, I do not need to worry, but collective causes
trouble.

Apologies if I have made a mistake or explained badly, I'm revisiting
some old code after a bit of time away ...

If your total amount of I/O is very small relative to overall runtime,
then I guess my advice would be just stick with independent I/O. To
take full advantage of collective I/O, you really are going to have to
re-work your code a bit I think so that you move from a "one dataset
per process" model to "shared dataset" model.

Perhaps Quincey can confirm my hunch -- I don't know the HDF5
internals that well and it could very well be doing something
super-clever in your case.

==rob

···

On Mon, Apr 06, 2009 at 04:23:11PM +0200, John Biddiscombe wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

When writing data using collective IO, I have N processes participating,
but occasionally, one or more processes has an empty dataset and nothing
to write.

Due to the way the code is structured, empty processes do not know the
type of the data that other processes are writing. It may be a list of
float/double/int/long etc etc, usually, the processes check the type
using C++ rtti and then do the write.

Processes with no data - cannot query the information, and so can't send
a NULL pointer with zero size because I don't know which routine is
being called (which datatype)

OK, up to this point I was all set to tell you about H5Sselect_none(),
but I guess that won't work for you...

  Ditto. :slight_smile:

By way of example, we have a series of templates, which use macros so
that for any datatype, we end up here ...

// A Convenience Macro which does what we want for any dataset type
// note that we use a dummy parameter and the ## macro operator to delay
expansion
// of the T2 parameter which causes problems if we don't
#define WriteDataArray(null, T2, f, name, dataarray) \
sprintf(typestring, "%s", #T2); \
dataset = H5Dcreate(f->timegroup, name.c_str(), null##T2, f->shape,
H5P_DEFAULT); \
if (dataset<0) { \
  ErrorMacro(<<"Dataset create failed for " << name.c_str() \
  << " Timestep " << f->timestep \
  << " Shape " << f->shape \
  << " Data Type " << #T2); \
  r = -1; \
} else { \
  void *dataptr = dataarray->GetVoidPointer(0); \
  dataptr = dataptr ? dataptr : &buffer[0]; \
  r = H5Dwrite(dataset, T2, memshape, diskshape, H5P_DEFAULT, dataptr); \
} \
H5Dclose(dataset);

Usually when people use collective I/O and HDF5, there is one big
global dataset, and the processes doing I/O define a hyperslab so that
each process is reading/writing a particular (maybe even overlapping)
region of that dataset.

In your case.. you are having each processor create a dataset for its
data. Do I understand correctly?

  Hmm, since dataset creation must be collective, either all the processes are participating in the H5Dwrite(), or something very strange is happening in this code.

And when calling H5Dcreate - we do not know what datatype we have ended
up with (result of the ##macro expansion.

I realize that this is a bit strange, but it does work - providing all
processes have something to write!

I don't doubt that this works, but I don't think you're actually
seeing any collective I/O benefit with this workload. One of the
largest benefits to collective I/O is that when you write out a small
region of a multidimensional array (a noncontiguous access), the
MPI-I/O library will re-arrange the accesses among all processors to
make the actual I/O more friendly to the i/o subsystem.

If you are running on a BlueGene system, though, then collective I/O
is always a win. it's just less of a win with this sort of workload.

Is there any way for a process to simply skip being part of the IO when
it has nothing? I don't want to create a new MPI_Group or anything, but
tell proc N (or rather the controller that is waiting for something from
it) that proc N is not sending anything.

it's actually more complicated than that.. you'd have to create a
whole new communicator, not just a group :>

When using independent IO, I do not need to worry, but collective causes
trouble.

Apologies if I have made a mistake or explained badly, I'm revisiting
some old code after a bit of time away ...

If your total amount of I/O is very small relative to overall runtime,
then I guess my advice would be just stick with independent I/O. To
take full advantage of collective I/O, you really are going to have to
re-work your code a bit I think so that you move from a "one dataset
per process" model to "shared dataset" model.

  I suppose it may be possible for us to create a "come along for the ride in some I/O on an unknown dataset" type of API call, but that seems awkward (at best :-).

Perhaps Quincey can confirm my hunch -- I don't know the HDF5
internals that well and it could very well be doing something
super-clever in your case.

  No, I tend to agree with you here. More details are needed...

    Quincey

···

On Apr 6, 2009, at 10:29 AM, Rob Latham wrote:

On Mon, Apr 06, 2009 at 04:23:11PM +0200, John Biddiscombe wrote:

Quincey, Rob,

Usually when people use collective I/O and HDF5, there is one big
global dataset, and the processes doing I/O define a hyperslab so that
each process is reading/writing a particular (maybe even overlapping)
region of that dataset.

In your case.. you are having each processor create a dataset for its
data. Do I understand correctly?

    Hmm, since dataset creation must be collective, either all the processes are participating in the H5Dwrite(), or something very strange is happening in this code.

errr. yes, after reading the first reply, I though. Oh dear, I've been doing it all wrong, but I did understand that all processes have to be involved in the Create call.
Each process selects a hyperslab, which is a subset of the data, but all perfomr the Create,Write calls.

it's actually more complicated than that.. you'd have to create a
whole new communicator, not just a group :>

yup. That's what I meant. (honest!).

If your total amount of I/O is very small relative to overall runtime,
then I guess my advice would be just stick with independent I/O.

The different in speed can be significant, but yes. I have tried
a) sticking to independent
b) I do an MPI send between processes right before writing, to exchange datatype. No doubt this hammers performance too. I'll have to try to find some way of knowing 'in advance' what the datatype was. I had hoped someone would tell me I could send a NULL pointer with H5T_NATIVE_ANYTHING or some handy flag that would let me get away with it.

    I suppose it may be possible for us to create a "come along for the ride in some I/O on an unknown dataset" type of API call, but that seems awkward (at best :-).

Awkward is good for me - as long as you are volunteering :slight_smile:

Perhaps Quincey can confirm my hunch -- I don't know the HDF5
internals that well and it could very well be doing something
super-clever in your case.

    No, I tend to agree with you here. More details are needed...

I'll rethink my initial communication and see if I can exchange types earlier on.

thanks for taking the time.

JB

···

--
John Biddiscombe, email:biddisco @ cscs.ch

CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

Hi John,

Quincey, Rob,

Usually when people use collective I/O and HDF5, there is one big
global dataset, and the processes doing I/O define a hyperslab so that
each process is reading/writing a particular (maybe even overlapping)
region of that dataset.

In your case.. you are having each processor create a dataset for its
data. Do I understand correctly?

   Hmm, since dataset creation must be collective, either all the processes are participating in the H5Dwrite(), or something very strange is happening in this code.

errr. yes, after reading the first reply, I though. Oh dear, I've been doing it all wrong, but I did understand that all processes have to be involved in the Create call.
Each process selects a hyperslab, which is a subset of the data, but all perfomr the Create,Write calls.

  As long as each process can participate in the H5Dwrite(), you can perform that call collectively.

it's actually more complicated than that.. you'd have to create a
whole new communicator, not just a group :>

yup. That's what I meant. (honest!).

If your total amount of I/O is very small relative to overall runtime,
then I guess my advice would be just stick with independent I/O.

The different in speed can be significant, but yes. I have tried
a) sticking to independent
b) I do an MPI send between processes right before writing, to exchange datatype. No doubt this hammers performance too. I'll have to try to find some way of knowing 'in advance' what the datatype was. I had hoped someone would tell me I could send a NULL pointer with H5T_NATIVE_ANYTHING or some handy flag that would let me get away with it.

   I suppose it may be possible for us to create a "come along for the ride in some I/O on an unknown dataset" type of API call, but that seems awkward (at best :-).

Awkward is good for me - as long as you are volunteering :slight_smile:

  I wasn't actually volunteering, just thinking out loud. :slight_smile: If you'd like to hack around in the code, I can point you in the correct direction. BTW, do the processes with nothing to write have the dataset ID being operated on (probably not, given your comments) or just the file ID?

Perhaps Quincey can confirm my hunch -- I don't know the HDF5
internals that well and it could very well be doing something
super-clever in your case.

   No, I tend to agree with you here. More details are needed...

I'll rethink my initial communication and see if I can exchange types earlier on.

  That would probably work also.

    Quincey

···

On Apr 6, 2009, at 2:09 PM, John Biddiscombe wrote:

The different in speed can be significant, but yes. I have tried

...

b) I do an MPI send between processes right before writing, to exchange
datatype. No doubt this hammers performance too. I'll have to try to
find some way of knowing 'in advance' what the datatype was. I had hoped
someone would tell me I could send a NULL pointer with
H5T_NATIVE_ANYTHING or some handy flag that would let me get away with
it.

I imagine you could send the type with one byte's worth of data. I
know of no system where that would incur significant overhead relative
to I/O. Plus, if you're about to enter a collective I/O call, there
will be a bunch of communication anyway.

==rob

···

On Mon, Apr 06, 2009 at 09:09:13PM +0200, John Biddiscombe wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

There's also H5Tencode/H5Tdecode (as well as H5Sencode/H5Sdecode) routines in 1.8.0

    Quincey

···

On Apr 6, 2009, at 2:39 PM, Rob Latham wrote:

On Mon, Apr 06, 2009 at 09:09:13PM +0200, John Biddiscombe wrote:

The different in speed can be significant, but yes. I have tried

...

b) I do an MPI send between processes right before writing, to exchange
datatype. No doubt this hammers performance too. I'll have to try to
find some way of knowing 'in advance' what the datatype was. I had hoped
someone would tell me I could send a NULL pointer with
H5T_NATIVE_ANYTHING or some handy flag that would let me get away with
it.

I imagine you could send the type with one byte's worth of data. I
know of no system where that would incur significant overhead relative
to I/O. Plus, if you're about to enter a collective I/O call, there
will be a bunch of communication anyway.

Rob

Plus, if you're about to enter a collective I/O call, there
will be a bunch of communication anyway.
  
Can you by any chance tell me where I can find out exactly what goes on as a precursor to a collective IO operation. I'd like to understand how the decision about who sends what to whom...

Is there a document anywhere which outlines the sequence of calls made and how decisions are reached.

thanks

JB

···

--
John Biddiscombe, email:biddisco @ cscs.ch


CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

Ah, sorry, I see how I confused the issue a bit. I just meant that in
a common application pattern, all processes do some work (simulate a
timestep, say) then write out the result (render a movie frame,
perhaps?).

There will likely be communication either before or after the
computational phase, so if the I/O library does some small amount of
communication in order to optimize I/O, it typically doesn't introduce
a lot of overhead.

I guess if code is embarrassingly parallel and writes out large
contiguous blocks of I/O, then the collective I/O communication might
introduce overhead. If either of those two things are false,
collective I/O is likely to be a win, but of course the best way is to
try out both approaches with your application and see.

==rob

···

On Sun, Apr 19, 2009 at 10:27:36AM +0200, John Biddiscombe wrote:

Plus, if you're about to enter a collective I/O call, there
will be a bunch of communication anyway.
  
Can you by any chance tell me where I can find out exactly what goes on
as a precursor to a collective IO operation. I'd like to understand how
the decision about who sends what to whom...

Is there a document anywhere which outlines the sequence of calls made
and how decisions are reached.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA