HDF5 across multiple nodes/disks

Sebastian_Good · January 6, 2012, 2:37am

I am looking for design pointers on how people handle extremely large
datasets, e.g. 10TB - 5000TB, using HDF. We would like to be able to do
parallel reads or writes, from multiple mpi ranks, but want these ranks to
be on different nodes, with each node able to utilize its own local disk.
That is, instead of, say 50TB of data being in a single file on a network
file system, we might want to arrange for ~3TB to be on each of 20 nodes'
local file systems. Clearly, this is not actually just one file, but many,
but I wonder if there are patterns or libraries that help with this sort of
staging, such that it can be handled by one set of tools then be relatively
invisible to the MPI based analysis programs who wish to work with it.

robl · January 6, 2012, 3:04pm

You need to be careful here. You are asking for both a distributed
file system and a specific set of consistency semantics. You can get
all kinds of distributed file systems but they will not provide the
sort of consistency you require.

Install something like PVFS on these nodes, if you really want to use
node-local storage this way. Make each node both a client and a
server in the PVFS environment. Presto: a fairly large file system
volume with the kinds of consistency semantics you require for a
parallel I/O workload.

==rob

···

On Thu, Jan 05, 2012 at 08:37:38PM -0600, Sebastian Good wrote:

We would like to be able to do
parallel reads or writes, from multiple mpi ranks, but want these ranks to
be on different nodes, with each node able to utilize its own local disk.
That is, instead of, say 50TB of data being in a single file on a network
file system, we might want to arrange for ~3TB to be on each of 20 nodes'
local file systems.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Sebastian_Good · January 6, 2012, 4:07pm

Appreciate the pointers. We anticipate much of the load to be read-only, so
may be able to tackle the consistency problem another way. I'll go look at
PVFS!

···

On Fri, Jan 6, 2012 at 9:04 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Jan 05, 2012 at 08:37:38PM -0600, Sebastian Good wrote:
> We would like to be able to do
> parallel reads or writes, from multiple mpi ranks, but want these ranks
to
> be on different nodes, with each node able to utilize its own local disk.
> That is, instead of, say 50TB of data being in a single file on a network
> file system, we might want to arrange for ~3TB to be on each of 20 nodes'
> local file systems.

You need to be careful here. You are asking for both a distributed
file system and a specific set of consistency semantics. You can get
all kinds of distributed file systems but they will not provide the
sort of consistency you require.

Install something like PVFS on these nodes, if you really want to use
node-local storage this way. Make each node both a client and a
server in the PVFS environment. Presto: a fairly large file system
volume with the kinds of consistency semantics you require for a
parallel I/O workload.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
*Sebastian Good*