Flexible parallel HDF5, Set Aside Process question

John_Biddiscombe · August 3, 2009, 8:30am

In the SAP document, dated Nov 2003 by Cheng, Koziol and Wendling, the merits of using a Set aside process for managing metadata are explained, but there is a limitation mentioned -

"If you have a group, creating and/or deleting multiple objects inside that group cannot be done independently. This needs to be synchronized"

This statement puzzled me. If the metadata is managed by the SAP, and every process checks in with the SAP when it reads metadata, to check for modifications, then why is there a restriction for separate groups. Is there something particular about the way groups are created/stored which needs further management.

I'd welcome any explanation.

Thanks

JB

···

--
John Biddiscombe, email:biddisco @ cscs.ch

CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

Quincey_Koziol · August 4, 2009, 3:42am

Hi John,

···

On Aug 3, 2009, at 3:30 AM, John Biddiscombe wrote:

In the SAP document, dated Nov 2003 by Cheng, Koziol and Wendling, the merits of using a Set aside process for managing metadata are explained, but there is a limitation mentioned -

"If you have a group, creating and/or deleting multiple objects inside that group cannot be done independently. This needs to be synchronized"

This statement puzzled me. If the metadata is managed by the SAP, and every process checks in with the SAP when it reads metadata, to check for modifications, then why is there a restriction for separate groups. Is there something particular about the way groups are created/stored which needs further management.

I'd welcome any explanation.

I believe that the document is referring to the problem of having race conditions when creating multiple objects of the same name from independent MPI processes. It's been a few years and my memory is a bit fuzzy at this point, but I think that was the primary issue...

Quincey

robl · August 12, 2009, 3:19pm

There's another complication to the SAP in that it creates a
surprising condition for consumers of parallel HDF5. We talked about
this a lot back in 2003/2004 and I too have only rough memories of the
discussion but I think the biggest surprise is like this:

- Consumer creates an MPI communicator of N procs and passes it to HDF5
- Consumer surprised when HDF5 does I/O with N-1 procs.

The folks at ANL strongly advise against a set aside process for HDF5
and any other similar library.

==rob

···

On Mon, Aug 03, 2009 at 10:30:57AM +0200, John Biddiscombe wrote:

In the SAP document, dated Nov 2003 by Cheng, Koziol and Wendling, the
merits of using a Set aside process for managing metadata are explained,
but there is a limitation mentioned -

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Quincey_Koziol · August 18, 2009, 2:55pm

Hi Rob,

···

On Aug 12, 2009, at 10:19 AM, Rob Latham wrote:

On Mon, Aug 03, 2009 at 10:30:57AM +0200, John Biddiscombe wrote:

In the SAP document, dated Nov 2003 by Cheng, Koziol and Wendling, the
merits of using a Set aside process for managing metadata are explained,
but there is a limitation mentioned -

There's another complication to the SAP in that it creates a
surprising condition for consumers of parallel HDF5. We talked about
this a lot back in 2003/2004 and I too have only rough memories of the
discussion but I think the biggest surprise is like this:

- Consumer creates an MPI communicator of N procs and passes it to HDF5
- Consumer surprised when HDF5 does I/O with N-1 procs.

The folks at ANL strongly advise against a set aside process for HDF5
and any other similar library.

Yep. That was a core problem with the SAP model. Right now, I'm exploring the idea of using MPI RMA routines in the MPI3 era to manipulate a "shared" data structure in a way that doesn't require a SAP. We'll see...

Quincey

John_Biddiscombe · August 18, 2009, 3:30pm

Quincey, Rob,

I was pondering an alternative (messy?) way of generating the multiple
group structure that we wanted and I thought I’d float this idea …
It goes like this.

Each process wants to create some arbitrary group/dataset, but in
general, each process is going to create a different group/dataset,
though it may be useful to have N processes in one group, or some other
combination (no need to be overly restrictive)

[nb. I’m approaching this from the vtkCompositeDataStructure - and in
particular vtkCompositeDataIterator - viewpoint if you’re familiar with
these structures]

Each process (collectively) calls HDF5CompositeCollectiveCreate(blah)
and passes in a tree object. The tree object contains nodes and leaves
which describe (in a format of my choosing at the moment) what groups,
subgroups datasets (dimensions etc) and everything else it wants to
create.

Because the call is collective, all processes participate, and
internally, the tree is populated by exchanging all-to-all information
about the nodes and leaves so that when complete, each process contains
the full tree, and not just the part it wanted to create.

The function returns an iterator object which each
process can now use to collectively

while not iterator->done

Create group/dataset etc

endwhile

Thus each process has knowledge about who is creating what, and all
create calls are done in the same order on each process, with all
required information so they return without blocking.

An iterator object (C++ only in my implementation I suppose) would
allow me to then do

iterator->local leaves only

loop

do stuff

endloop

when iterating over local datasets and not wanting to do stuff with
remote ones. (loop over all at start for create, loop over all at end
for close).

I am planning (probably) on implementing something along these lines as
a wrapper function on top of the core HDF5 as I believe it will
(elegantly?) solve my own problem. but I wonder if it is of interest
beyond my application - or if it has merits/flaws that provoke
discussion?

JB

···


-- John Biddiscombe, email:biddisco @ cscs.ch
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

http://www.cscs.ch/

miller86 · August 18, 2009, 3:58pm

Hello JB,

I see your salutation was to Quincey and Rob but I just couldn't resist
responding myself. Apologies if following comments are unwanted.

Overall, I think the idea of having a means to 'build up' knowledge of a
bunch of HDF5 file objects (e.g. file metadata changing kinds of object
creation) an individual processor is going to want to create, locally,
and then do a single, collective 'sync' so that all processors can, in
one (hopefully optimized and well placed) collective operation update
their file meta data.

If I understand you (let me know if my understanding is completely off
base), you propose a means here where the HDF5 client (your application)
creates an 'in-memory' representation for an HDF5 file group/dataset
object tree. Each processor can do that work independently and without
communication. Then, a 'sync' function is called with this tree as input
on each processor and at that point the tree is used to drive the actual
creation of real HDF5 file objects in the file.

My view is, why isn't the HDF5 library proper able to maintain this 'in-
memory' representation for you? It certainly already has the
infrastructure necessary to do it. And, why isn't the HDF5 library
itself able to do the 'sync' operation for you? If HDF5 would be allowed
to operate in a mode where it would 'hold off' on collective metadata
sync operations until the client specifically requested them by making a
collective 'CompositeCreate' call as you suggest, your client really has
to do very little additional work.

And, finally, is it possible for the parallel case where a bunch of
processors want to do independent object creation (e.g. file metadata
changing operations) to look as much as possible like green-garden
variety, serial, HDF5? I think all are possible. The approach I
described last week I called 'deferred object creation', where nothing
in the way your HDF5 client interacts with HDF5 needs to change except
the inclusion of one additional, 'sync' call. In almost all other
respects, each processor's interaction with the file proceeds
independently and completely similarly to a serial HDF5 interaction. In
fact, it would be possible to take an existing serial code and with very
little work adjust it so that it works in a parallel setting where
different processors want to independently create different objects in
the file.

Mark

···

On Tue, 2009-08-18 at 17:30 +0200, John Biddiscombe wrote:

Quincey, Rob,

I was pondering an alternative (messy?) way of generating the multiple
group structure that we wanted and I thought I'd float this idea ....
It goes like this.

Each process wants to create some arbitrary group/dataset, but in
general, each process is going to create a different group/dataset,
though it may be useful to have N processes in one group, or some
other combination (no need to be overly restrictive)

[nb. I'm approaching this from the vtkCompositeDataStructure - and in
particular vtkCompositeDataIterator - viewpoint if you're familiar
with these structures]

Each process (collectively) calls HDF5CompositeCollectiveCreate(blah)
and passes in a tree object. The tree object contains nodes and leaves
which describe (in a format of my choosing at the moment) what groups,
subgroups datasets (dimensions etc) and everything else it wants to
create.
Because the call is collective, all processes participate, and
internally, the tree is populated by exchanging all-to-all information
about the nodes and leaves so that when complete, each process
contains the full tree, and not just the part it wanted to create.
The function returns an iterator object which each process can now use
to collectively
while not iterator->done
Create group/dataset etc
endwhile

Thus each process has knowledge about who is creating what, and all
create calls are done in the same order on each process, with all
required information so they return without blocking.

An iterator object (C++ only in my implementation I suppose) would
allow me to then do
iterator->local leaves only
loop
do stuff
endloop

when iterating over local datasets and not wanting to do stuff with
remote ones. (loop over all at start for create, loop over all at end
for close).

I am planning (probably) on implementing something along these lines
as a wrapper function on top of the core HDF5 as I believe it will
(elegantly?) solve my own problem. but I wonder if it is of interest
beyond my application - or if it has merits/flaws that provoke
discussion?

JB

--
John Biddiscombe, email:biddisco @ cscs.ch
http://*www.*cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

John_Biddiscombe · August 18, 2009, 4:28pm

Mark

My view is, why isn't the HDF5 library proper able to maintain this 'in-
memory' representation for you? It certainly already has the
infrastructure necessary to do it.

I'm sure it is! But creates needs to be collective (as the library currently stands). I hoped that by introducing the tree structure + iterator, an explicit user driven mode of independent (but not really independent) would be possible. (Half way house).

Process 0 :
  Add Group to tree
  Add Dataset Size (a0.b0.c0) to tree
  Add Dataset Size (a2.b2.c2) to tree

Process 1 :
Add Different Group to tree
Add another different Dataset Size (a1.b1.c1) to tree
...
Some processes will not be creating anything at all, so they add nothing to the (local) tree but everyone calls

HDF5CompositeCollectiveCreate(my local tree)

iterate over returned tree iterator
create stuff
end

And, why isn't the HDF5 library
itself able to do the 'sync' operation for you? If HDF5 would be allowed
to operate in a mode where it would 'hold off' on collective metadata
sync operations until the client specifically requested them by making a
collective 'CompositeCreate' call as you suggest, your client really has
to do very little additional work.

OK, I can see how this would work. The reason I propose the above is because I can implement it entirely as a wrapper on top of HDF5 without modifying core library code - and it solves my immediate problem. A new mode in HDF itself which delayed metadata, but allowed sync calls to be made would achieve the same thing (race conditions on dataset creation? - a sync would have to be forced before any call to write). Of course, the user would need to put sync calls in the appropriate places ... which my method also requires - (except that they are forced to be in one place).

Below - 'deferred object creation' - I did not read threads last week. I will go through the mailing list and find them. I'll respond again accordingly.

JB

···

And, finally, is it possible for the parallel case where a bunch of
processors want to do independent object creation (e.g. file metadata
changing operations) to look as much as possible like green-garden
variety, serial, HDF5? I think all are possible. The approach I
described last week I called 'deferred object creation', where nothing
in the way your HDF5 client interacts with HDF5 needs to change except
the inclusion of one additional, 'sync' call. In almost all other
respects, each processor's interaction with the file proceeds
independently and completely similarly to a serial HDF5 interaction. In
fact, it would be possible to take an existing serial code and with very
little work adjust it so that it works in a parallel setting where
different processors want to independently create different objects in
the file.

Mark

On Tue, 2009-08-18 at 17:30 +0200, John Biddiscombe wrote:


Quincey, Rob,

I was pondering an alternative (messy?) way of generating the multiple
group structure that we wanted and I thought I'd float this idea ....
It goes like this.

Each process wants to create some arbitrary group/dataset, but in
general, each process is going to create a different group/dataset,
though it may be useful to have N processes in one group, or some
other combination (no need to be overly restrictive)

[nb. I'm approaching this from the vtkCompositeDataStructure - and in
particular vtkCompositeDataIterator - viewpoint if you're familiar
with these structures]

Each process (collectively) calls HDF5CompositeCollectiveCreate(blah)
and passes in a tree object. The tree object contains nodes and leaves
which describe (in a format of my choosing at the moment) what groups,
subgroups datasets (dimensions etc) and everything else it wants to
create.
Because the call is collective, all processes participate, and
internally, the tree is populated by exchanging all-to-all information
about the nodes and leaves so that when complete, each process
contains the full tree, and not just the part it wanted to create. The function returns an iterator object which each process can now use
to collectively
while not iterator->done
  Create group/dataset etc
endwhile

Thus each process has knowledge about who is creating what, and all
create calls are done in the same order on each process, with all
required information so they return without blocking.

An iterator object (C++ only in my implementation I suppose) would
allow me to then do
iterator->local leaves only
loop
  do stuff
endloop

when iterating over local datasets and not wanting to do stuff with
remote ones. (loop over all at start for create, loop over all at end
for close).

I am planning (probably) on implementing something along these lines
as a wrapper function on top of the core HDF5 as I believe it will
(elegantly?) solve my own problem. but I wonder if it is of interest
beyond my application - or if it has merits/flaws that provoke
discussion?

JB

--
John Biddiscombe, email:biddisco @ cscs.ch
http://*www.*cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
John Biddiscombe, email:biddisco @ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

Quincey_Koziol · August 18, 2009, 5:35pm

Hi all,

Mark

My view is, why isn't the HDF5 library proper able to maintain this 'in-
memory' representation for you? It certainly already has the
infrastructure necessary to do it.

I'm sure it is! But creates needs to be collective (as the library currently stands). I hoped that by introducing the tree structure + iterator, an explicit user driven mode of independent (but not really independent) would be possible. (Half way house).

Process 0 :
Add Group to tree
Add Dataset Size (a0.b0.c0) to tree
Add Dataset Size (a2.b2.c2) to tree

Process 1 :
Add Different Group to tree
Add another different Dataset Size (a1.b1.c1) to tree
...
Some processes will not be creating anything at all, so they add nothing to the (local) tree but everyone calls

HDF5CompositeCollectiveCreate(my local tree)

iterate over returned tree iterator
create stuff
end

And, why isn't the HDF5 library
itself able to do the 'sync' operation for you? If HDF5 would be allowed
to operate in a mode where it would 'hold off' on collective metadata
sync operations until the client specifically requested them by making a
collective 'CompositeCreate' call as you suggest, your client really has
to do very little additional work.

OK, I can see how this would work. The reason I propose the above is because I can implement it entirely as a wrapper on top of HDF5 without modifying core library code - and it solves my immediate problem. A new mode in HDF itself which delayed metadata, but allowed sync calls to be made would achieve the same thing (race conditions on dataset creation? - a sync would have to be forced before any call to write). Of course, the user would need to put sync calls in the appropriate places ... which my method also requires - (except that they are forced to be in one place).

Mark describes one good way to attack the problem (the "collective create" method) but the problem isn't truly technical - Mark and I (and also internally at The HDF Group) have designed several different solutions to this problem. At this point, it has come down to a matter of funding the work to get to a general [-enough] solution that the community can use. Without that funding or a contribution of working, tested code, I don't think we're going to make progress on this issue. I'd _love_ to do the work, but we don't have spare money internally to fund it. Sorry to rain on the party, but if anyone has any ideas for digging up funding, I would jump on any reasonable opportunities that were presented.

Quincey

···

On Aug 18, 2009, at 11:28 AM, John Biddiscombe wrote:

Below - 'deferred object creation' - I did not read threads last week. I will go through the mailing list and find them. I'll respond again accordingly.

JB

And, finally, is it possible for the parallel case where a bunch of
processors want to do independent object creation (e.g. file metadata
changing operations) to look as much as possible like green-garden
variety, serial, HDF5? I think all are possible. The approach I
described last week I called 'deferred object creation', where nothing
in the way your HDF5 client interacts with HDF5 needs to change except
the inclusion of one additional, 'sync' call. In almost all other
respects, each processor's interaction with the file proceeds
independently and completely similarly to a serial HDF5 interaction. In
fact, it would be possible to take an existing serial code and with very
little work adjust it so that it works in a parallel setting where
different processors want to independently create different objects in
the file.

Mark

On Tue, 2009-08-18 at 17:30 +0200, John Biddiscombe wrote:

Quincey, Rob,

I was pondering an alternative (messy?) way of generating the multiple
group structure that we wanted and I thought I'd float this idea ....
It goes like this.

Each process wants to create some arbitrary group/dataset, but in
general, each process is going to create a different group/dataset,
though it may be useful to have N processes in one group, or some
other combination (no need to be overly restrictive)

[nb. I'm approaching this from the vtkCompositeDataStructure - and in
particular vtkCompositeDataIterator - viewpoint if you're familiar
with these structures]

Each process (collectively) calls HDF5CompositeCollectiveCreate(blah)
and passes in a tree object. The tree object contains nodes and leaves
which describe (in a format of my choosing at the moment) what groups,
subgroups datasets (dimensions etc) and everything else it wants to
create.
Because the call is collective, all processes participate, and
internally, the tree is populated by exchanging all-to-all information
about the nodes and leaves so that when complete, each process
contains the full tree, and not just the part it wanted to create. The function returns an iterator object which each process can now use
to collectively
while not iterator->done
Create group/dataset etc
endwhile

Thus each process has knowledge about who is creating what, and all
create calls are done in the same order on each process, with all
required information so they return without blocking.
An iterator object (C++ only in my implementation I suppose) would
allow me to then do
iterator->local leaves only
loop
do stuff
endloop

when iterating over local datasets and not wanting to do stuff with
remote ones. (loop over all at start for create, loop over all at end
for close).

I am planning (probably) on implementing something along these lines
as a wrapper function on top of the core HDF5 as I believe it will
(elegantly?) solve my own problem. but I wonder if it is of interest
beyond my application - or if it has merits/flaws that provoke
discussion?

JB

--
John Biddiscombe, email:biddisco @ cscs.ch
http://*www.*cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
John Biddiscombe, email:biddisco @ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

miller86 · August 18, 2009, 10:43pm

Hi Quincey,

I sense a wee bit of frustration in this response from you. I guess I
can understand that. It most certainly is difficult to make progress in
these directions without funding. At the same time, I have to say I am a
bit stunned that enhancing HDF5's parallel interface is or has been such
a hard sell for you guys. Why is it so difficult to get funding for this
kind of thing? I would think almost all of the major computing centers
in the US and around the world would be interested in enhancing HDF5's
parallel interface. Basically, it has remained in the same form it was
when it was initially funded by ASCI Tri-Lab efforts more than 7 years
ago. I no longer operate in those circles and so I no idea where we
might be able to shake some money loose within DOE/NNSA for this. I can
certainly poke around LLNL a bit for you. But, I think I have a lot more
confidence in the '...contribution of working, tested code,...' route. I
would be interested in doing the work myself. At the same time, I am
totally unfamiliar with HDF5 internals and the little bit I've looked at
has overwhelmed me in what I think I need to learn before I can make a
productive contribution of code. What kind of 'help' might I be able to
get if I embarked on such an effort myself?

Mark

···

On Tue, 2009-08-18 at 12:35 -0500, Quincey Koziol wrote:

Mark describes one good way to attack the problem (the "collective
create" method) but the problem isn't truly technical - Mark and I
(and also internally at The HDF Group) have designed several different
solutions to this problem. At this point, it has come down to a
matter of funding the work to get to a general [-enough] solution that
the community can use. Without that funding or a contribution of
working, tested code, I don't think we're going to make progress on
this issue. I'd _love_ to do the work, but we don't have spare money
internally to fund it. Sorry to rain on the party, but if anyone has
any ideas for digging up funding, I would jump on any reasonable
opportunities that were presented.

Quincey

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

epourmal · August 19, 2009, 5:37am

Hi Mark,

You brought up really good issues that go far beyond this particular feature.

Hi Quincey,

I sense a wee bit of frustration in this response from you. I guess I
can understand that. It most certainly is difficult to make progress in
these directions without funding. At the same time, I have to say I am a
bit stunned that enhancing HDF5's parallel interface is or has been such
a hard sell for you guys.

It is not a hard sell at all. Many of us here will be very excited to work on the enhancements to the parallel library and make all
users in the world happy

Why is it so difficult to get funding for this
kind of thing?

Good question. There are many reasons. The lack of resources in our group to "hunt" for funding, which is a full time job, is one of them.
But funding for the feature is only part of the issue. How to fund its future maintenance? It is even harder than to fund the development!

I would think almost all of the major computing centers
in the US and around the world would be interested in enhancing HDF5's
parallel interface.
Basically, it has remained in the same form it was
when it was initially funded by ASCI Tri-Lab efforts more than 7 years
ago.

True except there were many improvements to the parallel library (and to HDF5 in general).

I no longer operate in those circles and so I no idea where we
might be able to shake some money loose within DOE/NNSA for this. I can
certainly poke around LLNL a bit for you. But, I think I have a lot more
confidence in the '...contribution of working, tested code,...' route.

I think many of us agree with you on this route (it is an open source software!), but there are questions that have to be answered first: how to maintain the contributed code? how to distributes it? should it be part of the main library? (and probably many other...)
Just an example: Let's assume you will give us a perfectly written, working code tested on N platforms with M combinations of compilers and different versions of MPI and FSs, etc. In two-three years all compilers and platforms become obsolete, MPI moves to the new standard and your code doesn't even compile. Who will be fixing it? Our group may not have resources (and may be even expertise) to do it.

We (The HDF Group) are trying to figure out the answers to those questions. It is not an easy task and may be we are moving too slowly. Our first step was to set up hdf-forum to shape up the HDF5 community that could share the knowledge, work on the problems like discussed in this thread, find the solutions, and make them available to the HDF5 users. Support for the third party filters was another step. Managing contributions is on our radar, but I think it cannot be done without help from the HDF5 community.

I
would be interested in doing the work myself. At the same time, I am
totally unfamiliar with HDF5 internals and the little bit I've looked at
has overwhelmed me in what I think I need to learn before I can make a
productive contribution of code. What kind of 'help' might I be able to
get if I embarked on such an effort myself?

HDF-forum will be your best option. We cannot handle those questions through our helpdesk. (Sorry to disappoint you). But I am sure Quincey will chime in

Elena

···

On Aug 18, 2009, at 5:43 PM, Mark Miller wrote:

Mark

On Tue, 2009-08-18 at 12:35 -0500, Quincey Koziol wrote:

Mark describes one good way to attack the problem (the "collective
create" method) but the problem isn't truly technical - Mark and I
(and also internally at The HDF Group) have designed several different
solutions to this problem. At this point, it has come down to a
matter of funding the work to get to a general [-enough] solution that
the community can use. Without that funding or a contribution of
working, tested code, I don't think we're going to make progress on
this issue. I'd _love_ to do the work, but we don't have spare money
internally to fund it. Sorry to rain on the party, but if anyone has
any ideas for digging up funding, I would jump on any reasonable
opportunities that were presented.

Quincey

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Quincey_Koziol · August 19, 2009, 5:39pm

Hi Mark,

Hi Quincey,

I sense a wee bit of frustration in this response from you. I guess I
can understand that.

I'm not really that frustrated, I just wanted the community to know that we'd thought about solutions to this problem and were ready to make some progress, if we could.

It most certainly is difficult to make progress in
these directions without funding. At the same time, I have to say I am a
bit stunned that enhancing HDF5's parallel interface is or has been such
a hard sell for you guys. Why is it so difficult to get funding for this
kind of thing? I would think almost all of the major computing centers
in the US and around the world would be interested in enhancing HDF5's
parallel interface. Basically, it has remained in the same form it was
when it was initially funded by ASCI Tri-Lab efforts more than 7 years
ago. I no longer operate in those circles and so I no idea where we
might be able to shake some money loose within DOE/NNSA for this. I can
certainly poke around LLNL a bit for you.

I'm not certain about the underlying issue really... I'm guessing it's a bit of a chicken-and-egg problem - since HDF5 doesn't have all the features that HPC users want, they don't really want to fully embrace it; and since they don't fully embrace it, it's hard to get them to pay for enhancements to fix problems. :-/ So, I'm trying to work with people who are willing to work with us and address the problems that cause them pain, which hopefully will cause more of the HPC community to use HDF5 and get the snowball rolling faster. Poking around for funding would be much appreciated also!

But, I think I have a lot more
confidence in the '...contribution of working, tested code,...' route. I
would be interested in doing the work myself. At the same time, I am
totally unfamiliar with HDF5 internals and the little bit I've looked at
has overwhelmed me in what I think I need to learn before I can make a
productive contribution of code. What kind of 'help' might I be able to
get if I embarked on such an effort myself?

I'm certainly willing to answer questions and provide guidance, either here on the forum or privately. I can also test out code changes on a wide variety of machines. I can't really take much time to code or debug directly though, since my time is pretty booked with other tasks.

Quincey

···

On Aug 18, 2009, at 5:43 PM, Mark Miller wrote:

Mark

On Tue, 2009-08-18 at 12:35 -0500, Quincey Koziol wrote:

Mark describes one good way to attack the problem (the "collective
create" method) but the problem isn't truly technical - Mark and I
(and also internally at The HDF Group) have designed several different
solutions to this problem. At this point, it has come down to a
matter of funding the work to get to a general [-enough] solution that
the community can use. Without that funding or a contribution of
working, tested code, I don't think we're going to make progress on
this issue. I'd _love_ to do the work, but we don't have spare money
internally to fund it. Sorry to rain on the party, but if anyone has
any ideas for digging up funding, I would jump on any reasonable
opportunities that were presented.

Quincey

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

John_Biddiscombe · September 10, 2009, 8:30am

Mark et al,

The approach I described last week I called
’deferred object creation’,

I managed to find time to review what Mark and others wrote on the
subject, both in this thread, and the one discussing the Poor Man’s
Parallel IO. I find myself in complete agreement with all that Mark
wrote … and am also in the the category of “Would really like to help
and
contribute code, but getting up to speed on the entire HDF5 internals
is a full time job in it’s own right”.

You (Quincey) also wrote …

Mark describes one good way to attack the
problem (the “collective
create” method) but the problem isn’t truly technical - Mark and I (and
also internally at The HDF Group) have designed several different
solutions to this problem
In order to further my understanding of the internals, I’d be very
interesting in discovering more about the possible solutions. Are there
any other resources out there which I ought to be reading or
investigating? (other than the already referenced SAP stuff)
Particularly with regard to the problem of synchronized, but not
necessarily collective operations when the number of nodes starts to
become very large.

Thanks

JB

···


-- John Biddiscombe, email:biddisco @ cscs.ch
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

http://www.cscs.ch/

miller86 · August 19, 2009, 4:05pm

Hi Elena,

Thanks for your responses. Let me clarify one or two points...

Hi Mark,

You brought up really good issues that go far beyond this particular
feature.

> Hi Quincey,
>
> I sense a wee bit of frustration in this response from you. I
> guess I
> can understand that. It most certainly is difficult to make progress
> in
> these directions without funding. At the same time, I have to say I
> am a
> bit stunned that enhancing HDF5's parallel interface is or has been
> such
> a hard sell for you guys.

It is not a hard sell at all. Many of us here will be very excited to
work on the enhancements to the parallel library and make all
users in the world happy

When I asked about it being a 'hard sell', I was talking about HDF
Group's ability to 'sell' it to potential users and attract funding not
about internally, within HDF Group's interest in pursuing it. I do
believe HDF Group is interested in pursuing it.

> Why is it so difficult to get funding for this
> kind of thing?

Good question. There are many reasons. The lack of resources in our
group to "hunt" for funding, which is a full time job, is one of them.
But funding for the feature is only part of the issue. How to fund its
future maintenance? It is even harder than to fund the development!

> I would think almost all of the major computing centers
> in the US and around the world would be interested in enhancing HDF5's
> parallel interface.
> Basically, it has remained in the same form it was
> when it was initially funded by ASCI Tri-Lab efforts more than 7 years
> ago.
True except there were many improvements to the parallel library (and
to HDF5 in general).

My apologies. I did not mean to sound so critical. But let me mention a
concern I had then that may still be relevant. Back then, I was
concerned that the ONLY 'real' users of parallel HDF5 on anything but
'toy' scalability studies and proofs of concept were ASCI Tri-Lab -- so,
I am talking production computing here using 100's to 1000's of cpus to
interact with HDF5 files via parallel interface. If that was/is indeed
the case, then my concern is that HDF Group may not see a real 'market'
for anything more than what you already have in the way of parallel
interface. If that is so, I and LLNL certainly are NOT helping that
situation any by continuing to use what I've called 'poor mans parallel
I/O' (e.g. a suite of 'coordinated' serial HDF5 files).

Does the HDF Group see a real 'market' for a 'better' parallel
interface? Before you answer, let me say that I think the scientific
computing community is indeed entering a new 'era' of computing with
apps that run on 100,000+ cpus. Of course, we've heard all this 'hype'
in the past but now I think its actually true. LLNL has done some recent
simulations and visualizations on 128,000 and 64,000 cpus respectively.
And, that was on only a fraction of the full machine, Dawn, we expect to
have fully delivered by 2011. We expect to run 1,000,000 cpu runs by
then. ANL is building similar systems. I think 'poor mans parallel I/O'
can scale to 100,000 cpus 'ok'. Above that, I think we are going to
actually have to start thinking harder about how to do parallel I/O
'right'? And, an improved interface to parallel HDF5 I think would be
critical to success there. Thats my story to my management at least. I
am looking into possibility if that can turn into real $$$ to HDF Group.
We'll see.

···

On Wed, 2009-08-19 at 00:37 -0500, Elena Pourmal wrote:

On Aug 18, 2009, at 5:43 PM, Mark Miller wrote:

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

Quincey_Koziol · September 10, 2009, 11:55am

Hi John,

Mark et al,

The approach I described last week I called 'deferred object creation',

I managed to find time to review what Mark and others wrote on the subject, both in this thread, and the one discussing the Poor Man's Parallel IO. I find myself in complete agreement with all that Mark wrote ... and am also in the the category of "Would really like to help and contribute code, but getting up to speed on the entire HDF5 internals is a full time job in it's own right".

As I said, I'm happy to advise, etc. but don't have any time for actual coding work on this right now. Soon I am going to be assembling an overview of how HDF5 uses MPI internally, which I'm happy to post to our web-site as a resource for anyone who wants to work on improving any aspect of parallel I/O with HDF5. (I'm guessing that this will be ready in ~1 month).
If there's enough interest, I think we could set up a branch in our subversion repository to support those working on it. Can anyone interested in writing code toward this end reply to me directly (koziol@hdfgroup.org) and I'll see what I can pull together?

You (Quincey) also wrote ...

Mark describes one good way to attack the problem (the "collective create" method) but the problem isn't truly technical - Mark and I (and also internally at The HDF Group) have designed several different solutions to this problem

In order to further my understanding of the internals, I'd be very interesting in discovering more about the possible solutions. Are there any other resources out there which I ought to be reading or investigating? (other than the already referenced SAP stuff) Particularly with regard to the problem of synchronized, but not necessarily collective operations when the number of nodes starts to become very large.

  Hmm, I don't have a very formal document describing the solutions I've proposed, but most scalable solutions to the problem boil down to devising a mechanism for "single resources" in HDF5 files to be modified by only one rank in an MPI application, but have the changes be visible to all other MPI ranks. Some single resources in a file include:
  - The free space in an HDF5 file. (This is similar to creating a thread-safe malloc library, only using MPI instead of POSIX threads)
  - The contents of a group (i.e. the links) in an HDF5 file. (This is similar to creating a thread-safe associative array, with MPI)
  - The current dimensions of a dataset with unlimited dimensions in an HDF5 file. (This is probably similar to a thread-safe hash table or B-tree, with MPI)

Various other approaches (like the one that "poor man's parallel HDF5" that Mark has proposed, as well as some other ideas I've floated before) have been tendered, but I'm concerned that they don't solve the problem in a really scalable way. They may very well be "good enough" and be the right choice if a completely scalable solution is too much effort...

Quincey

···

On Sep 10, 2009, at 3:30 AM, John Biddiscombe wrote:

Thanks

JB

Hi Quincey,

   I sense a wee bit of frustration in this response from you. I guess I
can understand that. It most certainly is difficult to make progress in
these directions without funding. At the same time, I have to say I am a
bit stunned that enhancing HDF5's parallel interface is or has been such
a hard sell for you guys. Why is it so difficult to get funding for this
kind of thing? I would think almost all of the major computing centers
in the US and around the world would be interested in enhancing HDF5's
parallel interface. Basically, it has remained in the same form it was
when it was initially funded by ASCI Tri-Lab efforts more than 7 years
ago. I no longer operate in those circles and so I no idea where we
might be able to shake some money loose within DOE/NNSA for this. I can
certainly poke around LLNL a bit for you. But, I think I have a lot more
confidence in the '...contribution of working, tested code,...' route. I
would be interested in doing the work myself. At the same time, I am
totally unfamiliar with HDF5 internals and the little bit I've looked at
has overwhelmed me in what I think I need to learn before I can make a
productive contribution of code. What kind of 'help' might I be able to
get if I embarked on such an effort myself?

Mark

On Tue, 2009-08-18 at 12:35 -0500, Quincey Koziol wrote:

  Mark describes one good way to attack the problem (the "collective
create" method) but the problem isn't truly technical - Mark and I
(and also internally at The HDF Group) have designed several different
solutions to this problem. At this point, it has come down to a
matter of funding the work to get to a general [-enough] solution that
the community can use. Without that funding or a contribution of
working, tested code, I don't think we're going to make progress on
this issue. I'd _love_ to do the work, but we don't have spare money
internally to fund it. Sorry to rain on the party, but if anyone has
any ideas for digging up funding, I would jump on any reasonable
opportunities that were presented.

  Quincey

--
John Biddiscombe, email:biddisco @ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Quincey_Koziol · August 19, 2009, 5:51pm

Hi Mark,

Hi Elena,

Thanks for your responses. Let me clarify one or two points...

Hi Mark,

You brought up really good issues that go far beyond this particular
feature.

Hi Quincey,

I sense a wee bit of frustration in this response from you. I
guess I
can understand that. It most certainly is difficult to make progress
in
these directions without funding. At the same time, I have to say I
am a
bit stunned that enhancing HDF5's parallel interface is or has been
such
a hard sell for you guys.

It is not a hard sell at all. Many of us here will be very excited to
work on the enhancements to the parallel library and make all
users in the world happy

When I asked about it being a 'hard sell', I was talking about HDF
Group's ability to 'sell' it to potential users and attract funding not
about internally, within HDF Group's interest in pursuing it. I do
believe HDF Group is interested in pursuing it.

Why is it so difficult to get funding for this
kind of thing?

Good question. There are many reasons. The lack of resources in our
group to "hunt" for funding, which is a full time job, is one of them.
But funding for the feature is only part of the issue. How to fund its
future maintenance? It is even harder than to fund the development!

I would think almost all of the major computing centers
in the US and around the world would be interested in enhancing HDF5's
parallel interface.
Basically, it has remained in the same form it was
when it was initially funded by ASCI Tri-Lab efforts more than 7 years
ago.

True except there were many improvements to the parallel library (and
to HDF5 in general).

My apologies. I did not mean to sound so critical. But let me mention a
concern I had then that may still be relevant. Back then, I was
concerned that the ONLY 'real' users of parallel HDF5 on anything but
'toy' scalability studies and proofs of concept were ASCI Tri-Lab -- so,
I am talking production computing here using 100's to 1000's of cpus to
interact with HDF5 files via parallel interface. If that was/is indeed
the case, then my concern is that HDF Group may not see a real 'market'
for anything more than what you already have in the way of parallel
interface. If that is so, I and LLNL certainly are NOT helping that
situation any by continuing to use what I've called 'poor mans parallel
I/O' (e.g. a suite of 'coordinated' serial HDF5 files).

Does the HDF Group see a real 'market' for a 'better' parallel
interface? Before you answer, let me say that I think the scientific
computing community is indeed entering a new 'era' of computing with
apps that run on 100,000+ cpus. Of course, we've heard all this 'hype'
in the past but now I think its actually true. LLNL has done some recent
simulations and visualizations on 128,000 and 64,000 cpus respectively.
And, that was on only a fraction of the full machine, Dawn, we expect to
have fully delivered by 2011. We expect to run 1,000,000 cpu runs by
then. ANL is building similar systems. I think 'poor mans parallel I/O'
can scale to 100,000 cpus 'ok'. Above that, I think we are going to
actually have to start thinking harder about how to do parallel I/O
'right'? And, an improved interface to parallel HDF5 I think would be
critical to success there. Thats my story to my management at least. I
am looking into possibility if that can turn into real $$$ to HDF Group.
We'll see.

I do think there is a real market for a better parallel HDF5, even if it's strictly funded by grants by US government agencies for the next 1-5 years. More and more communities are embracing parallel computing and are also using HDF5 for a wide variety of their storage needs, and they will eventually need the capabilities that the DOE labs need now. Part of our effort has been to develop closer ties with the MPI community, particularly the MPICH developers at ANL, so we can leverage the gains that layer is making. Another part of the effort needs to happen within the HDF5 library itself and I think that's very possible, given a reasonable level of funding, to drive the performance and features toward what the application developers running on 100K-1M+ core machines need.

It may be "poor man's parallel" today and something based on MPI3 features in the future, along with other improvements, but we'd very much like to be part of that story.

Quincey

···

On Aug 19, 2009, at 11:05 AM, Mark Miller wrote:

On Wed, 2009-08-19 at 00:37 -0500, Elena Pourmal wrote:

On Aug 18, 2009, at 5:43 PM, Mark Miller wrote: