Merging of .h5 files / parallel access

gosh, I guess it shows how life at ANL has warped my brain but it
never crossed my mind that anybody would ever want to use serial HDF5
for simultaneous access.

If there are several machines participating in this process, maybe
there are more areas than just HDF5 that would benefit from being an
MPI program.

==rob

···

On Thu, Aug 06, 2009 at 11:56:31AM -0700, Mark Miller wrote:

> As for the second option - is it possible to have several machines
> contributing data to an HDF5 file at the same time? How does one
> manage this?

Concurrent I/O to the same file via HDF5 is going to REQUIRE the use of
HDF5's parallel driver as well as the various limitations (collectivity
in various metadata-changing operations such as H5Dcreate) that come
with it. If multiple serial HDF5 clients open the SAME HDF5 file and try
to write to it, that'll create problems.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Rob,

Hmm. Not sure whether you were poking fun or down right insulted by my
comment, quoted below.

But, two follow-ups...

First, the problem I see with HDF5's parallel interface is the
REQUIREMENT for collective dataset creation. That means no one processor
can decide to create its own dataset without having to get all other
processors in the MPI communicator being used to interact with the file
involved. Thats just way too over-constraining in most applications that
having anything other than non-trivial I/O patterns.

FWIW, we use serial HDF5 for parallel applications very successfully and
scalably using an approach I like to call 'Poor Mans Parallel I/O'. Its
scalable, its flexible in terms of what each processor wants to do with
the HDF5 file and its frankly very easy to code u;p much easier than
collective-parallel in the presence of an application where different
processors wind up managing wholly different and independent HDF5
objects.

Anyhow, sorry of my comment offended.

Mark

···

On Wed, 2009-08-12 at 10:24 -0500, Rob Latham wrote:

On Thu, Aug 06, 2009 at 11:56:31AM -0700, Mark Miller wrote:
> > As for the second option - is it possible to have several machines
> > contributing data to an HDF5 file at the same time? How does one
> > manage this?
>
> Concurrent I/O to the same file via HDF5 is going to REQUIRE the use of
> HDF5's parallel driver as well as the various limitations (collectivity
> in various metadata-changing operations such as H5Dcreate) that come
> with it. If multiple serial HDF5 clients open the SAME HDF5 file and try
> to write to it, that'll create problems.

gosh, I guess it shows how life at ANL has warped my brain but it
never crossed my mind that anybody would ever want to use serial HDF5
for simultaneous access.

If there are several machines participating in this process, maybe
there are more areas than just HDF5 that would benefit from being an
MPI program.

==rob

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

Ha, well, there's a first for everything, right?

The HDF5 solution that's being developed is (unfortunately) being deployed
in a very tight schedule, so, expanding to utilizing MPI would probably push
beyond the timeframe that's allocated for this.

Qunicey - thanks for your support. Your recommendations helped to push me
in the right direction.

···

On Wed, Aug 12, 2009 at 11:24 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Aug 06, 2009 at 11:56:31AM -0700, Mark Miller wrote:
> > As for the second option - is it possible to have several machines
> > contributing data to an HDF5 file at the same time? How does one
> > manage this?
>
> Concurrent I/O to the same file via HDF5 is going to REQUIRE the use of
> HDF5's parallel driver as well as the various limitations (collectivity
> in various metadata-changing operations such as H5Dcreate) that come
> with it. If multiple serial HDF5 clients open the SAME HDF5 file and try
> to write to it, that'll create problems.

gosh, I guess it shows how life at ANL has warped my brain but it
never crossed my mind that anybody would ever want to use serial HDF5
for simultaneous access.

If there are several machines participating in this process, maybe
there are more areas than just HDF5 that would benefit from being an
MPI program.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Stefan Novak
Sent from Greenbelt, Maryland, United States

Hmm. Not sure whether you were poking fun or down right insulted by my
comment, quoted below.

Hey no offense at all. I was indeed poking fun, but at myself.
Clearly this approach works for a large class of problems.

This discussion, though, is actually pretty exciting to me, and it
parallels a frequent discussion in the file system domain: is it
better to write one file per process or get all the processes to
coordinate access to a single file.

First, the problem I see with HDF5's parallel interface is the
REQUIREMENT for collective dataset creation. That means no one processor
can decide to create its own dataset without having to get all other
processors in the MPI communicator being used to interact with the file
involved. Thats just way too over-constraining in most applications that
having anything other than non-trivial I/O patterns.

This is very interesting to me. What I usually see in applications
that want to do this is to create one HDF5 file per process, which
you've determined is not good (definitely not a controversial
assessment).

FWIW, we use serial HDF5 for parallel applications very successfully
and scalably using an approach I like to call 'Poor Mans Parallel
I/O'. Its scalable, its flexible in terms of what each processor
wants to do with the HDF5 file and its frankly very easy to code u;p
much easier than collective-parallel in the presence of an
application where different processors wind up managing wholly
different and independent HDF5 objects.

Here's what I meant by ANL warping my brain: all of the HDF5 apps I
see work with a single dataset, and decompose that dataset into, say,
3d subcubes (the dataset is a big ball of gas, and each process stores
its data in a region, but the whole hdf5 file contains one dataset).

In our environment, single-dataset operation has a lot of advantages:

- makes checkpont/restart easier (the single dataset is independent of
  the number of processors and so can be restarted on a smaller or
  larger run, or an entirely different machine if need be)

- makes post-processing and analysis easier: tools can expect one
  dataset and no matter how it was produced, that's what they'll get.

Anyhow, sorry of my comment offended.

I can't stress enough how un-offended I was. Clearly my sense of
humor in my sleep deprived state is making itself manifest in bizarre
ways.

==rob

···

On Wed, Aug 12, 2009 at 09:11:49AM -0700, Mark Miller wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

> Hmm. Not sure whether you were poking fun or down right insulted by my
> comment, quoted below.

Hey no offense at all. I was indeed poking fun, but at myself.
Clearly this approach works for a large class of problems.

Thanks for clarification.

This discussion, though, is actually pretty exciting to me, and it
parallels a frequent discussion in the file system domain: is it
better to write one file per process or get all the processes to
coordinate access to a single file.

So, there is an all-or-nothing assumption here that has become common-
place but leads to a completely artificial constraint; that is the
choice is either a file-per-process or one file for all processes.

Poor Man's Parallel I/O (PMPIO) provides you with a knob to 'dial-in'
the number of files that are written COMPLETELY INDEPENDENTLY of the
number of processors doing the writing. We typically use values of 32,
64, 128 or 256 files depending upon how many real I/O channels we have
from compute nodes to the file-system. But, we'll write to these numbers
of files from tens of thousands of cpus. I think so far, we've scaled to
256,000 cpus writing to 1024 files.

If you are interested, I have attached two source code files,
pmpio_hdf5_test.c and pmpio.h. If you build a SERIAL HDF5 and then
compile pmpio_hdf5_test.c and link it with MPI, you can run it and see
an example of a VERY SIMPLE Poor Man's Parallel I/O test client. It is
really unceremonialy simple. But, it demonstrates the basic approach.
The PMPIO routines defined in pmpio.h can be easily integrated into any
application currently using a file-per-process approach.

The HDF5 team has worked on the SAP approach. I liked the SAP idea on
the surface but a single (set-aside) processor simply isn't going to
scale well to 10,000+ cpus. So, it probably needs to at least be a
SAPSSSS approach as in 'Set Aside ProcessorSSSS' and then you still have
the issue with concurrent metadata distributed across the SAPSSS. In
addition, I don't like the idea of an application having to take into
account an increased processor allocation for the SAPPSSSS.

So, a long while back a colleague of mine and I worked on a 'deferred
object creation' strategy where each processor is required to segregate
'HDF5 file metadata-changing' operations to specific regions of
execution. All processors are interacting with a single HDF5 file.
However, in these segregated regions of execution, processors make
requests for HDF5 object creation (H5Dcreate, H5Gcreate, etc.) that they
intend to use shortly thereafter. The requests are queued, locally and
the object creation is actually deferred. Attempts to operate on the
created objects will fail. Then, collectively, after all processors have
completed their 'metadata changing' operations, they call a 'sync-my-
pending-requests-with-hdf5' function. It is a collective function. Upon
return, each processor can then, again, proceed independently operating
on the objects they created. The actual implementation would involve
adding a new property to various object creation property lists
indicating 'deferred creation' is being requested. This enables the
calls to H5<whatever>create to return immediately and simply queue the
request. The object ids returned would contain information indicating
they are 'not yet created'. A new call such as H5FcreateComplete() would
have to be called to sync everything across all processors. Calling this
function would 'clear' the 'not yet created' info in all the pending
objects. I think such an approach would be a substantial improvement
over existing collective interface and avoid problems with the SAP
approach. I think we implemented some of this in a layer on top of HDF5
back in 2003/04 but it never had enough interest to make it into HDF5
proper.

Mark

pmpio_hdf5_test.c (5.13 KB)

pmpio.h (19.5 KB)

···

On Wed, 2009-08-12 at 13:46 -0500, Rob Latham wrote:

On Wed, Aug 12, 2009 at 09:11:49AM -0700, Mark Miller wrote:

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

So, a long while back a colleague of mine and I worked on a 'deferred
object creation' strategy where each processor is required to segregate
'HDF5 file metadata-changing' operations to specific regions of
execution.

I think you have essentially described netcdf define mode. pnetcdf
demands that all processors define the same variables and dimensions,
but if instead it took the union of each process's define-mode, then
we'd have what you've described. the define/data modality of pnetcdf
drives some people batty, but it does let pnetcdf sidestep a lot of
complexity -- we've never had to worry about a free-block allocator,
for example.

I think such an approach would be a substantial improvement over
existing collective interface and avoid problems with the SAP
approach. I think we implemented some of this in a layer on top of
HDF5 back in 2003/04 but it never had enough interest to make it
into HDF5 proper.

I agree with you that this approach could be a big help to parallel
hdf5 applications.

I'm going to take a closer look at your PMPIO stuff, too. Thanks for
sharing.

==rob

···

On Wed, Aug 12, 2009 at 12:32:08PM -0700, Mark Miller wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

I am not really too familiar with netcdf. But, I think that maybe so,
maybe not. I recall that due to the netcdf format (bytes on disk), re-
entering the define mode in netcdf a second, or third, etc. time was
'painful'. So, there was a lot of motivation to only ever 'define' once.

In the HDF5 approach I outline, there would be no such 'penalty' for
entering the 'deferred object creation phase' multiple times. Its just
that things would have to 'synced up' upon completion of each such
phase. Also, I don't think there is any implied requirement that all
metadata reside on all cpus after its sync'd. I think objects that were
requested for creation by one processor could be 'paged out' on all
other processors. That way, no metadata storage scalability issues.

Mark

···

On Wed, 2009-08-12 at 15:35 -0500, Rob Latham wrote:

On Wed, Aug 12, 2009 at 12:32:08PM -0700, Mark Miller wrote:
> So, a long while back a colleague of mine and I worked on a 'deferred
> object creation' strategy where each processor is required to segregate
> 'HDF5 file metadata-changing' operations to specific regions of
> execution.

I think you have essentially described netcdf define mode.

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)