Hi,
We are in the process of writing a proposal for creating a cloud computing
environment for the Seventh Framework Programme of the European Commission.
Our plan is to setup a cloud made with already existing infrastructure that
has support for parallel grid computing (including MPI-IO) and I was wondering
if we could make use of the parallel version of HDF5 so that people using
Virtual Machines (VM) in the cloud could make use of parallel I/O.
Of course, the response should be affirmative if what we have is a *parallel*
application that makes use of the *parallel* API of the *parallel* HDF5.
However, this has to be an easy-to-use service, so I don't like the idea of
the user having to use so many explicit parallelism in their apps.
So I'm toying with the idea of using FUSE (http://fuse.sourceforge.net/) so as
to mount an existing HDF5 file as if it was a real filesystem. With this, the
user would be able to do many different actions against the file. For
example, let us suppose that the file 'data.h5' is mounted in '/tmp/data'.
With this, the user could access the contents in the file like this:
# listing all the datasets in 'data' filesystem
$ ls /tmp/data
/ (RootGroup) ''
/array_1 (Array(3,)) 'Signed short array'
/array_f (Array(20, 3, 2)) '3-D float array'
/array_s (Array()) 'Scalar signed short array'
/Events (Group) ''
/Events/TEvent1 (Table(257,)) 'Events: TEvent1'
/Events/TEvent2 (Table(257,)) 'Events: TEvent2'
/Events/TEvent3 (Table(257,)) 'Events: TEvent3'
# ascii data dump for dataset 'array_1'
$ cat /tmp/data/array_1
[0] -1
[1] 2
[2] 4
# data dump of *part* of 'array_f'
$ cat /tmp/data/array_f[10]
[10] [[ 60. 61.]
[ 62. 63.]
[ 64. 65.]]
# data selection via complex queries:
$ cat /tmp/data/Events/TEvent1[(xcoord >= 62500) & (xcoord < 64000)]
[250] (500, 250, 'Event: 250', 62500.0, 3906249984.0)
[251] (502, 251, 'Event: 251', 63001.0, 3969125888.0)
[252] (504, 252, 'Event: 252', 63504.0, 4032758016.0)
[Perhaps characters like '[' or '(' could not be used, but I hope you've got
the idea]
For writing, people could create new groups or datasets easily:
$ mkdir /tmp/data/new_group
$ touch /tmp/data/new_group/new_dataset
but I'm not sure how to feed FUSE with meta-information (datatypes,
dimensionality, compression...) about the datasets newly created. A
possibility is to use the standard UNIX 'chattr' command, but I'm not sure
whether FUSE would support this at all. Other possibility is to support only
1-D datasets of bytes (just like a regular binary file).
I know that in this list there are people that have been playing with
FUSE/HDF5, so I'd be glad to have feed back from them (it worked as expected?
how much overhead can FUSE introduce?).
Of course, the idea is to have a parallel HDF5 application running under FUSE,
so that the VM user can take advantage of improved I/O speed. Would that be
possible or I'm dreaming too much?
Thanks,
···
--
Francesc Alted