data layout / collective I/O question

I made a little ASCII art pictures to try help me describe what I want to
do:

file proc #1 proc #2

+----+----+ +------+ +------+

1111|2222| |......| |......|
1111|2222| |.1111.| |.2222.|

+----+----+ |.1111.| |.2222.|

3333|4444| |......| |......|
3333|4444| +------+ +------+

+----+----+ |......|

···

.4444.|

             >.4444.|
             >......|
             +------+
             >......|
             >.3333.|
             >.3333.|
             >......|
             +------+

I want to write a dataset with a 3D global array (2D in the above, but you
get the picture), where individual blocks come from different processors
(and have ghostpoints, indicated by '.'). Currently I can do this by doing a
bunch of collective calls and hyperslabs, in the example above, I'd do one
write for blocks 1 and 2, and then two more writes where proc 1 would have
block 4, 3, respectively, while the 2nd proc would use H5S_NULL.

However, this in general ends up being extremely slow because lots and lots
of small writes happen, and even collective buffering within ROMIO can't fix
those, because not all needed data to do larger contiguous writes are, in
general, available within any given collective write (in the example above,
it'd actually happen to work because blocks 1 and 2 are written within the
same collective write).

I suppose the particular case above would benefit hugely from using chunks
of the block size (though I haven't tried), since now all block writes would
end up being contiguous. But I'm looking for a more general solution, ie., I
want to do one single collective write call and leave it to the lower layers
to figure out how to do the write efficiently -- and they don't have a
chance when using separate writes as described above. In particular, it
happens that not all blocks have the same size, in which case the chunk
method cannot resolve the issue, anyway.

I'm basically looking for a way to do something similar to the "union of
hyperslabs", but doing it as an ordered sequence of hyperslabs, both in file
and memory spaces. Is there any possibility in HDF5 to achieve that (while
maintaining a hope of performance, I suppose I could define the pattern
point by point instead of using hyperslabs right now, but that just doesn't
seem right)?

So, on proc 1 want to say:

append hyperslab(filespace, <block1>)
append hyperslab(memspace, <block1>)
append hyperslab(filespace, <block4>)
append hyperslab(memspace, <block4>)
append hyperslab(filespace, <block3>)
append hyperslab(memspace, <block3>)

on proc 2:

append hyperslab(filespace, <block2>)
append hyperslab(memspace, <block2>)

and then do one collective write with those spaces.

I read something on this list about a more general "virtual dataset" scheme,
which would in some sense extend the chunking idea and sounds like it could
work very well for this case, but I take it that's not something that will
be available anytime soon (if ever).

Thanks,
--Kai

Is H5Sselect_hyperslab with H5S_SELECT_OR not sufficient?

==rob

···

On Tue, Apr 05, 2011 at 01:55:36PM -0400, Kai Germaschewski wrote:

So, on proc 1 want to say:

append hyperslab(filespace, <block1>)
append hyperslab(memspace, <block1>)
append hyperslab(filespace, <block4>)
append hyperslab(memspace, <block4>)
append hyperslab(filespace, <block3>)
append hyperslab(memspace, <block3>)

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Rob,

thanks for your reply.

> So, on proc 1 want to say:
>
> append hyperslab(filespace, <block1>)
> append hyperslab(memspace, <block1>)
> append hyperslab(filespace, <block4>)
> append hyperslab(memspace, <block4>)
> append hyperslab(filespace, <block3>)
> append hyperslab(memspace, <block3>)

Is H5Sselect_hyperslab with H5S_SELECT_OR not sufficient?

Unfortunately, no, because that does not preserve the ordering I need. (I
actually tried it, even, but as I suspected, "OR" is commutative but for me
the order matters.)

Actually, I think there's even a non-parallel case where the current
interface is too limiting for me (well, there might well be another way of
doing it, like defining a datatype for an entire block and then doing a list
of such block "points".):

file proc #1

+----+----+ +------+

1111|2222| |......|
1111|2222| |.1111.|

+----+----+ |.1111.|

···

On Wed, Apr 6, 2011 at 1:39 PM, Rob Latham <robl@mcs.anl.gov> wrote:

On Tue, Apr 05, 2011 at 01:55:36PM -0400, Kai Germaschewski wrote:

             >......|
             +------+
             >......|
             >.2222.|
             >.2222.|
             >......|
             +------+

In this case, it doesn't even matter whether one selects the filespace with
two OR's or just everything in the first place, the order is going to be
row-by-row "11112222" in the file, and obviously that's not the case in the
memory space, where you first get all of one block, then the other (I'm a C
programmer, but I maintain my fields in Fortran order, if that confuses you
just transpose the global field in the file space and the blocks in memory.)

If the two blocks lived on separate procs, everything would be fine, but
handling the case of more than one block per processor needs something like
a concept of a list of hyperslabs as opposed to a union of hyperslabs.

--Kai

--
Kai Germaschewski
Assistant Professor, Dept of Physics / Space Science Center
University of New Hampshire, Durham, NH 03824
office: Morse Hall 245E
phone: +1-603-862-2912
fax: +1-603-862-2771