Parallel HDF5/FUSE experiences?

Francesc_Alted2 · October 7, 2009, 6:04pm

Hi,

We are in the process of writing a proposal for creating a cloud computing
environment for the Seventh Framework Programme of the European Commission.

Our plan is to setup a cloud made with already existing infrastructure that
has support for parallel grid computing (including MPI-IO) and I was wondering
if we could make use of the parallel version of HDF5 so that people using
Virtual Machines (VM) in the cloud could make use of parallel I/O.

Of course, the response should be affirmative if what we have is a *parallel*
application that makes use of the *parallel* API of the *parallel* HDF5.
However, this has to be an easy-to-use service, so I don't like the idea of
the user having to use so many explicit parallelism in their apps.

So I'm toying with the idea of using FUSE (http://fuse.sourceforge.net/) so as
to mount an existing HDF5 file as if it was a real filesystem. With this, the
user would be able to do many different actions against the file. For
example, let us suppose that the file 'data.h5' is mounted in '/tmp/data'.
With this, the user could access the contents in the file like this:

# listing all the datasets in 'data' filesystem
$ ls /tmp/data
/ (RootGroup) ''
/array_1 (Array(3,)) 'Signed short array'
/array_f (Array(20, 3, 2)) '3-D float array'
/array_s (Array()) 'Scalar signed short array'
/Events (Group) ''
/Events/TEvent1 (Table(257,)) 'Events: TEvent1'
/Events/TEvent2 (Table(257,)) 'Events: TEvent2'
/Events/TEvent3 (Table(257,)) 'Events: TEvent3'

# ascii data dump for dataset 'array_1'
$ cat /tmp/data/array_1
[0] -1
[1] 2
[2] 4

# data dump of *part* of 'array_f'
$ cat /tmp/data/array_f[10]
[10] [[ 60. 61.]
[ 62. 63.]
[ 64. 65.]]

# data selection via complex queries:
$ cat /tmp/data/Events/TEvent1[(xcoord >= 62500) & (xcoord < 64000)]
[250] (500, 250, 'Event: 250', 62500.0, 3906249984.0)
[251] (502, 251, 'Event: 251', 63001.0, 3969125888.0)
[252] (504, 252, 'Event: 252', 63504.0, 4032758016.0)

[Perhaps characters like '[' or '(' could not be used, but I hope you've got
the idea]

For writing, people could create new groups or datasets easily:

$ mkdir /tmp/data/new_group
$ touch /tmp/data/new_group/new_dataset

but I'm not sure how to feed FUSE with meta-information (datatypes,
dimensionality, compression...) about the datasets newly created. A
possibility is to use the standard UNIX 'chattr' command, but I'm not sure
whether FUSE would support this at all. Other possibility is to support only
1-D datasets of bytes (just like a regular binary file).

I know that in this list there are people that have been playing with
FUSE/HDF5, so I'd be glad to have feed back from them (it worked as expected?
how much overhead can FUSE introduce?).

Of course, the idea is to have a parallel HDF5 application running under FUSE,
so that the VM user can take advantage of improved I/O speed. Would that be
possible or I'm dreaming too much?

Thanks,

···

--
Francesc Alted

andrew.collette · October 7, 2009, 8:42pm

Hi Francesc,

Of course, the response should be affirmative if what we have is a *parallel*
application that makes use of the *parallel* API of the *parallel* HDF5.
However, this has to be an easy-to-use service, so I don't like the idea of
the user having to use so many explicit parallelism in their apps.

Others here probably have much more experience than I do with parallel
HDF5, but one thing that's likely to affect your use of HDF5 at the
filesystem layer is collective calls. As far as I know, all
operations which affect the file tree (creating and closing groups and
datasets, modifying attributes) are collective operations which
require the participation of all processes to succeed. In this sense
it's hard to make parallel HDF5 "transparent" as MPI-IO would seem to
insist that all processes accessing the same file have to explicitly
participate.

Andrew

John_Shalf · October 7, 2009, 9:29pm

For read-only consistency, this would work just fine. Of course, you could accomplish the same goal by having each client open the same HDF5 file using the serial interface. For writes, things get a lot more complicated. The FUSE filesystem will not manage datafile consistency for multiple concurrent writes (it wouldn't help for managing write consistency for any file format for that matter).

-john

···

On Oct 7, 2009, at 11:04 AM, Francesc Alted wrote:

Hi,

We are in the process of writing a proposal for creating a cloud computing
environment for the Seventh Framework Programme of the European Commission.

Our plan is to setup a cloud made with already existing infrastructure that
has support for parallel grid computing (including MPI-IO) and I was wondering
if we could make use of the parallel version of HDF5 so that people using
Virtual Machines (VM) in the cloud could make use of parallel I/O.

Of course, the response should be affirmative if what we have is a *parallel*
application that makes use of the *parallel* API of the *parallel* HDF5.
However, this has to be an easy-to-use service, so I don't like the idea of
the user having to use so many explicit parallelism in their apps.

So I'm toying with the idea of using FUSE (http://fuse.sourceforge.net/) so as
to mount an existing HDF5 file as if it was a real filesystem. With this, the
user would be able to do many different actions against the file. For
example, let us suppose that the file 'data.h5' is mounted in '/tmp/data'.
With this, the user could access the contents in the file like this:

# listing all the datasets in 'data' filesystem
$ ls /tmp/data
/ (RootGroup) ''
/array_1 (Array(3,)) 'Signed short array'
/array_f (Array(20, 3, 2)) '3-D float array'
/array_s (Array()) 'Scalar signed short array'
/Events (Group) ''
/Events/TEvent1 (Table(257,)) 'Events: TEvent1'
/Events/TEvent2 (Table(257,)) 'Events: TEvent2'
/Events/TEvent3 (Table(257,)) 'Events: TEvent3'

# ascii data dump for dataset 'array_1'
$ cat /tmp/data/array_1
[0] -1
[1] 2
[2] 4

# data dump of *part* of 'array_f'
$ cat /tmp/data/array_f[10]
[10] [[ 60. 61.]
[ 62. 63.]
[ 64. 65.]]

# data selection via complex queries:
$ cat /tmp/data/Events/TEvent1[(xcoord >= 62500) & (xcoord < 64000)]
[250] (500, 250, 'Event: 250', 62500.0, 3906249984.0)
[251] (502, 251, 'Event: 251', 63001.0, 3969125888.0)
[252] (504, 252, 'Event: 252', 63504.0, 4032758016.0)

[Perhaps characters like '[' or '(' could not be used, but I hope you've got
the idea]

For writing, people could create new groups or datasets easily:

$ mkdir /tmp/data/new_group
$ touch /tmp/data/new_group/new_dataset

but I'm not sure how to feed FUSE with meta-information (datatypes,
dimensionality, compression...) about the datasets newly created. A
possibility is to use the standard UNIX 'chattr' command, but I'm not sure
whether FUSE would support this at all. Other possibility is to support only
1-D datasets of bytes (just like a regular binary file).

I know that in this list there are people that have been playing with
FUSE/HDF5, so I'd be glad to have feed back from them (it worked as expected?
how much overhead can FUSE introduce?).

Of course, the idea is to have a parallel HDF5 application running under FUSE,
so that the VM user can take advantage of improved I/O speed. Would that be
possible or I'm dreaming too much?

Thanks,

--
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Francesc_Alted2 · October 8, 2009, 7:01am

Hi Andrew,

A Wednesday 07 October 2009 22:42:06 Andrew Collette escrigué:

Hi Francesc,

> Of course, the response should be affirmative if what we have is a
> *parallel* application that makes use of the *parallel* API of the
> *parallel* HDF5. However, this has to be an easy-to-use service, so I
> don't like the idea of the user having to use so many explicit
> parallelism in their apps.

Others here probably have much more experience than I do with parallel
HDF5, but one thing that's likely to affect your use of HDF5 at the
filesystem layer is collective calls. As far as I know, all
operations which affect the file tree (creating and closing groups and
datasets, modifying attributes) are collective operations which
require the participation of all processes to succeed. In this sense
it's hard to make parallel HDF5 "transparent" as MPI-IO would seem to
insist that all processes accessing the same file have to explicitly
participate.

Yeah, but my idea is that it is the parallel app behind FUSE who should take
care of parallelism issues. FUSE user would only perceive a faster (because
it goes parallel under the hood) filesystem.

Mmh, as always, the best way to know is to have a try, I guess...

···

--
Francesc Alted

Francesc_Alted2 · October 8, 2009, 8:06am

Hi John,

A Wednesday 07 October 2009 23:29:35 John Shalf escrigué:

For read-only consistency, this would work just fine. Of course, you
could accomplish the same goal by having each client open the same
HDF5 file using the serial interface.

Uh, but what I wanted to achieve is that a *serial* application would take
advantage of the improved throughput of parallel HDF5. The FUSE app is just
the middleware that would make this possible.

For writes, things get a lot
more complicated. The FUSE filesystem will not manage datafile
consistency for multiple concurrent writes (it wouldn't help for
managing write consistency for any file format for that matter).

But, from what I understand from FUSE docs, is your FUSE application which
should take care of data consistency issues, not FUSE itself. So, if your app
behind FUSE starts several (parallel) processes when the HDF5 file is mounted,
then all write/read operations on this HDF5 file could be done via parallel
HDF5, safely.

Well, I think I need a proof-of-concept before proceeding further. Maybe I'm
missing something big. Or it may just be that FUSE layer would prevent any
speed-up at all :-/

Thanks,

···

--
Francesc Alted

Matthew_Street · October 8, 2009, 9:05am

Hi Francesc,

I don't know if it's entirely related to what you're after, but you may want to take a look at PLFS:

http://institute.lanl.gov/plfs

This puts a layer on top of FUSE and they have impressive results with some HDF5 parallel I/O benchmarks.

Matt

···

--
_______________________________________________________________________
Matt Street MSc MBCS
Parallel Technology Support Tel: +44 (0) 118 982 4528
High Performance Computing Group AWE, Aldermaston, Reading, RG7 4PR. UK.

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Francesc Alted
Sent: 08 October 2009 09:06
To: hdf-forum@hdfgroup.org
Subject: EXTERNAL: Re: [Hdf-forum] Parallel HDF5/FUSE experiences?

Hi John,

A Wednesday 07 October 2009 23:29:35 John Shalf escrigué:

For read-only consistency, this would work just fine. Of course, you
could accomplish the same goal by having each client open the same
HDF5 file using the serial interface.

Uh, but what I wanted to achieve is that a *serial* application would take
advantage of the improved throughput of parallel HDF5. The FUSE app is just
the middleware that would make this possible.

For writes, things get a lot
more complicated. The FUSE filesystem will not manage datafile
consistency for multiple concurrent writes (it wouldn't help for
managing write consistency for any file format for that matter).

But, from what I understand from FUSE docs, is your FUSE application which
should take care of data consistency issues, not FUSE itself. So, if your app
behind FUSE starts several (parallel) processes when the HDF5 file is mounted,
then all write/read operations on this HDF5 file could be done via parallel
HDF5, safely.

Well, I think I need a proof-of-concept before proceeding further. Maybe I'm
missing something big. Or it may just be that FUSE layer would prevent any
speed-up at all :-/

Thanks,

--
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

___________________________________________________
____________________________

The information in this email and in any attachment(s) is
commercial in confidence. If you are not the name adressee(s) or
if you receive this email in error then any distribution, copying or
use of this communication or the information in it is strictly
prohibited. Please notify us immediately by email at
admin.internet(at)awe.co.uk, and then delete this message from
your computer. While attachments are virus checked, AWE plc
does not accept any liability in respect of any virus which is not
detected.

AWE Plc
Registered in England and Wales
Registration No 02763902
AWE, Aldermaston, Reading, RG7 4PR

Francesc_Alted2 · October 8, 2009, 12:24pm

A Thursday 08 October 2009 11:05:25 Matthew.Street@awe.co.uk escrigué:

Hi Francesc,

I don't know if it's entirely related to what you're after, but you may
want to take a look at PLFS:

http://institute.lanl.gov/plfs

This puts a layer on top of FUSE and they have impressive results with some
HDF5 parallel I/O benchmarks.

Good point Matthew. I'm specially impressed that the overhead of the FUSE
layer in their benchmarks is just 20% (which is really great when you are
speaking of throughput figures between 1 GB/s and 5 GB/s). That's really much
better performance than I thought!

···

--
Francesc Alted