data layout for parallel read-only access

Mark_Moll1 · June 19, 2009, 3:35am

I have 1000s of data sets that take up on the order of 10s to 100s of GB large total. I have an MPI-based program that where each node reads a non-overlapping subset of the data sets. Only a small fraction of the total data will get read by the program.

Originally, I partitioned the data sets manually into a number of files equal to the number of nodes, used my own custom file format, and used memory-mapped I/O to access the data.

This seemed rather inflexible, so I started to rewrite my program to use HDF5 and dump everything into one file. This turned out to be a bad idea. If I use the Unix time command to measure the runtime of my new HDF5-enabled program I notice that the *user* time is somewhat larger than before (which is okay), but the *system* is much larger now (in the old program system time was almost negligible). I'm guessing that multiple processes trying to access the same file induces a lot of extra overhead. I'm not using Parallel HDF5 at this point.

I'm planning to run this program on a 8 core machine with a couple of local disks, as well as a cluster with a fast shared Lustre file system. Was is the best way to maximize performance? Split every dataset into a separate file? Or keep everything in one file and use Parallel HDF5?

Mark Moll

···

--
http://www.cs.rice.edu/~mmoll
Rice University, MS 132, PO Box 1892, Houston, TX 77251
713-348-5934 (phone), 713-348-5930 (fax)

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Moll1 · June 19, 2009, 4:47pm

Hi Mark,

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own" data sets.
- File-per-node is indeed a problem with varying concurrency. That's why I was thinking of a data-set-per-file organization. Each file needs to be accessed by only one node. Each node would have to read many files.

So the question is whether there is a performance difference between parallel asynchronous reading of many files vs. one large file. I'm guessing HDF5 does some preemptive fetching, which would work in favor of one large file.

I'll look into MPI-POSIX VFD. Thanks for the pointer.

···

On Jun 19, 2009, at 10:55 AM, Mark Howison wrote:

Hi Mark,

I would recommend using parallel HDF5 wsince you are already using
MPI. We've seen a number of performance issues with the lustre
filesystem on our Cray XT, which you may run into as well depending on
the size of your cluster and your lustre configuration. You may have
difficulty finding an MPI-IO implementation that is lustre aware.
Lustre performs much better with stripe-aligned access patterns, and
unless the MPI-IO implementation has built-in logic to maintain stripe
alignment, it will probably create excessive lock contention among
stripes. You might want to check with the ROMIO developers, though,
because I know they were working on incorporating better lustre
support and it may be available in a recent release.

The alternative to MPI-IO is to use the MPI-POISIX VFD, which uses MPI
for synchronization and POSIX calls for the low-level IO operations.
If you use the MPI-POSIX VFD and also set the chunking and alignment
properties for your dataset, you can force your pattern to align
better, assuming it already has some degree of non-scattered access.

Finally, file-per-node sometimes yields better performance and is
usually ok for low concurrency programs, but if you plan on scaling
the program up, you may want to write to a single shared file. One
problem with fixing the number of files to the number of nodes is what
happens if you decide to run at different concurrency later on?

Mark Howison

Visualization Group
Lawrence Berkeley National Lab
mhowison@lbl.gov

On Thu, Jun 18, 2009 at 8:35 PM, Mark Moll<mmoll@cs.rice.edu> wrote:

I have 1000s of data sets that take up on the order of 10s to 100s of GB
large total. I have an MPI-based program that where each node reads a
non-overlapping subset of the data sets. Only a small fraction of the total
data will get read by the program.

Originally, I partitioned the data sets manually into a number of files
equal to the number of nodes, used my own custom file format, and used
memory-mapped I/O to access the data.

This seemed rather inflexible, so I started to rewrite my program to use
HDF5 and dump everything into one file. This turned out to be a bad idea. If
I use the Unix time command to measure the runtime of my new HDF5-enabled
program I notice that the *user* time is somewhat larger than before (which
is okay), but the *system* is much larger now (in the old program system
time was almost negligible). I'm guessing that multiple processes trying to
access the same file induces a lot of extra overhead. I'm not using Parallel
HDF5 at this point.

I'm planning to run this program on a 8 core machine with a couple of local
disks, as well as a cluster with a fast shared Lustre file system. Was is
the best way to maximize performance? Split every dataset into a separate
file? Or keep everything in one file and use Parallel HDF5?

Mark Moll
--
Mark Moll | Mark Moll
Rice University, MS 132, PO Box 1892, Houston, TX 77251
713-348-5934 (phone), 713-348-5930 (fax)

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Howison · June 23, 2009, 5:34am

Hi Mark,

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

- File-per-node is indeed a problem with varying concurrency. That's why I
was thinking of a data-set-per-file organization. Each file needs to be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Mark

···

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 23, 2009, 1:53pm

Hi Mark,

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

Yes, this was my thought also. If the file is read-only, can each process open it independently?

Quincey

···

On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

- File-per-node is indeed a problem with varying concurrency. That's why I
was thinking of a data-set-per-file organization. Each file needs to be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 23, 2009, 2:24pm

Hi Mark,

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

Yes, this was my thought also. If the file is read-only, can each process open it independently?

Right, this is what I did. However, multiple processes trying to read the same file seems to introduce significantly more system overhead than multiple processes reading different files.

Are you opening the file with MPI or with the default HDF5 file driver?

It does sound more like a file system issue than something with HDF5...

Quinceyfor

···

On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:

On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:

On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

- File-per-node is indeed a problem with varying concurrency. That's why I
was thinking of a data-set-per-file organization. Each file needs to be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

I'm now experimenting with the data-set-per-file approach. I have an index file that contains (among other things) the file location of each data set. The data sets are stored in an automatically created directory hierarchy, so that no directory will have too many files.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Thanks for clarifying. That was my impression, but I wanted to make sure before giving up on that route.

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Moll1 · June 23, 2009, 5:49pm

Hi Mark,

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

  Yes, this was my thought also. If the file is read-only, can each process open it independently?

Right, this is what I did. However, multiple processes trying to read the same file seems to introduce significantly more system overhead than multiple processes reading different files.

  Are you opening the file with MPI or with the default HDF5 file driver?

  It does sound more like a file system issue than something with HDF5...

  Quinceyfor

I am using the default HDF5 file driver. I also think it is a file system issue. I was just hoping that HDF5 could just magically solve it :-).

···

On Jun 23, 2009, at 9:24 AM, Quincey Koziol wrote:

On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:

On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:

On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> >>>> wrote:

- File-per-node is indeed a problem with varying concurrency. That's why I
was thinking of a data-set-per-file organization. Each file needs to be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

I'm now experimenting with the data-set-per-file approach. I have an index file that contains (among other things) the file location of each data set. The data sets are stored in an automatically created directory hierarchy, so that no directory will have too many files.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Thanks for clarifying. That was my impression, but I wanted to make sure before giving up on that route.

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Moll1 · June 25, 2009, 2:20pm

[Just to summarize the discussion below: I have an MPI-based program that needs parallel, asynchronous, read-only access to different data sets. No two processes need access to the same data sets. In an old implementation without HDF5 this is done by dividing the data sets into a number of files that is a multiple of the number of processors and use mmap to access the data in each file. This is rather inflexible, so I started to look at HDF5. With HDF5 the program runs much slower.]

Initially, I thought the HDF5 version was much slower because all the data sets were stored in one file, but now I see the same thing happening when I put every data set into a separate file (# data sets >> # processors). I am trying to figure out what's going on with Shark and ThreadViewer on OS X. The profiles of the two versions of the program (with and without HDF5) look similar, except for some system calls in the HDF5 version. I actually hear the program hammering the disk in the HDF5 version, so I thought I'd use ThreadViewer to see whether the program was blocked. The picture below shows the difference between a thread in the HDF5 version (top) and the mmap-based version (bottom). Light green means the thread is running, dark means uninterruptible (usually system call), and yellow means recently running. Clearly, there's a huge difference, but I don't know how to dig any deeper to diagnose this issue.

Mark_Howison · June 25, 2009, 2:28pm

Hi Mark,

Can you give us some idea of the size of these files, the HDF5 VFD you
are using, the dimensionality of the datasets, and the storage method
(contig vs. chunk)? Also, I assume the slowdown you are reporting here
is not on a lustre, but on whatever filesystem you are using in OS X,
probably HFS+?

My guess is that mmap is performing large, contiguous IO operations,
while for some reason HDF5 isn't (possibly why you are hearing disk
thrashing).

Mark

···

On Thu, Jun 25, 2009 at 7:20 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

[Just to summarize the discussion below: I have an MPI-based program that
needs parallel, asynchronous, read-only access to different data sets. No
two processes need access to the same data sets. In an old implementation
without HDF5 this is done by dividing the data sets into a number of files
that is a multiple of the number of processors and use mmap to access the
data in each file. This is rather inflexible, so I started to look at HDF5.
With HDF5 the program runs much slower.]

Initially, I thought the HDF5 version was much slower because all the data
sets were stored in one file, but now I see the same thing happening when I
put every data set into a separate file (# data sets >> # processors). I am
trying to figure out what's going on with Shark and ThreadViewer on OS X.
The profiles of the two versions of the program (with and without HDF5) look
similar, except for some system calls in the HDF5 version. I actually hear
the program hammering the disk in the HDF5 version, so I thought I'd use
ThreadViewer to see whether the program was blocked. The picture below shows
the difference between a thread in the HDF5 version (top) and the mmap-based
version (bottom). Light green means the thread is running, dark means
uninterruptible (usually system call), and yellow means recently running.
Clearly, there's a huge difference, but I don't know how to dig any deeper
to diagnose this issue.

I use OpenMPI 1.3.2, HDF5 1.8.3, and gcc 4.3.3 on OS X 10.5.7 (although I
have observed the same behavior on Ubuntu Linux and RHEL with Intel and PGI
compilers). I have also run the program through valgrind (on Linux and OS X)
to see if that would turn up any suspicious activity, but to no avail.

On Jun 23, 2009, at 9:24 AM, Quincey Koziol wrote:
Hi Mark,

On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:
On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:
On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its
"own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.
   Yes, this was my thought also\.  If the file is read\-only, can
each process open it independently?
Right, this is what I did. However, multiple processes trying to read the
same file seems to introduce significantly more system overhead than
multiple processes reading different files.
   Are you opening the file with MPI or with the default HDF5 file
driver?
   It does sound more like a file system issue than something with
HDF5...
   Quinceyfor
- File-per-node is indeed a problem with varying concurrency. That's
why I
was thinking of a data-set-per-file organization. Each file needs to
be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

I'm now experimenting with the data-set-per-file approach. I have an
index file that contains (among other things) the file location of each data
set. The data sets are stored in an automatically created directory
hierarchy, so that no directory will have too many files.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor
of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Thanks for clarifying. That was my impression, but I wanted to make sure
before giving up on that route.

--
Mark
--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Moll1 · June 25, 2009, 3:28pm

Each file is a couple MB with ~1500 small datasets, of which only a small number will be read. The datasets are either 1- or 2-dimensional. Each dataset has at most a couple 1000 elements (mostly ints, some are floats or chars). I use the default VFD (sec2, right?). The datasets are contiguous. The specific results reported were on OS X, but I've seen a similar slowdown occur on Lustre as well (although not with this exact same revision of the code).

Here are my own "hdf5_hl" macros to read/write STL vector data:

#define HDF5CREATE_GROUP(loc_id, name) \
         H5Gcreate(loc_id, name, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT)
#define HDF5OPEN_GROUP(loc_id, name) \
         H5Gopen(loc_id, name, H5P_DEFAULT)
#define HDF5WRITE(group, ndims, dims, name, type, data) { \
         hid_t dataspace = H5Screate_simple(ndims, dims, NULL), \
                 dataset = H5Dcreate(group, name, type, dataspace, \
                         H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); \
         H5Dwrite(dataset, type, H5S_ALL, H5S_ALL, H5P_DEFAULT, &(data)[0]); \
         H5Dclose(dataset); H5Sclose(dataspace); }
#define HDF5READ(group, ndims, name, type, data) { \
         hid_t dataset = H5Dopen(group, name, H5P_DEFAULT), \
                 dataspace = H5Dget_space(dataset); \
         hsize_t dims[2]; \
         H5Sget_simple_extent_dims(dataspace,dims,NULL); \
         data.resize(ndims==1 ? dims[0] : dims[0]*dims[1]); \
         H5Dread(dataset, type, H5S_ALL, H5S_ALL, H5P_DEFAULT, &(data)[0]); \
         H5Dclose(dataset); H5Sclose(dataspace); }
#define HDF5READ1(group, name, type, data) { \
         hid_t dataset = H5Dopen(group, name, H5P_DEFAULT); \
         H5Dread(dataset, type, H5S_ALL, H5S_ALL, H5P_DEFAULT, &(data)[0]); \
         H5Dclose(dataset); }

···

On Jun 25, 2009, at 9:28 AM, Mark Howison wrote:

Hi Mark,

Can you give us some idea of the size of these files, the HDF5 VFD you
are using, the dimensionality of the datasets, and the storage method
(contig vs. chunk)? Also, I assume the slowdown you are reporting here
is not on a lustre, but on whatever filesystem you are using in OS X,
probably HFS+?

My guess is that mmap is performing large, contiguous IO operations,
while for some reason HDF5 isn't (possibly why you are hearing disk
thrashing).

Mark

On Thu, Jun 25, 2009 at 7:20 AM, Mark Moll<mmoll@cs.rice.edu> wrote:

[Just to summarize the discussion below: I have an MPI-based program that
needs parallel, asynchronous, read-only access to different data sets. No
two processes need access to the same data sets. In an old implementation
without HDF5 this is done by dividing the data sets into a number of files
that is a multiple of the number of processors and use mmap to access the
data in each file. This is rather inflexible, so I started to look at HDF5.
With HDF5 the program runs much slower.]

Initially, I thought the HDF5 version was much slower because all the data
sets were stored in one file, but now I see the same thing happening when I
put every data set into a separate file (# data sets >> # processors). I am
trying to figure out what's going on with Shark and ThreadViewer on OS X.
The profiles of the two versions of the program (with and without HDF5) look
similar, except for some system calls in the HDF5 version. I actually hear
the program hammering the disk in the HDF5 version, so I thought I'd use
ThreadViewer to see whether the program was blocked. The picture below shows
the difference between a thread in the HDF5 version (top) and the mmap-based
version (bottom). Light green means the thread is running, dark means
uninterruptible (usually system call), and yellow means recently running.
Clearly, there's a huge difference, but I don't know how to dig any deeper
to diagnose this issue.

I use OpenMPI 1.3.2, HDF5 1.8.3, and gcc 4.3.3 on OS X 10.5.7 (although I
have observed the same behavior on Ubuntu Linux and RHEL with Intel and PGI
compilers). I have also run the program through valgrind (on Linux and OS X)
to see if that would turn up any suspicious activity, but to no avail.

On Jun 23, 2009, at 9:24 AM, Quincey Koziol wrote:

Hi Mark,

On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:

On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:

On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> >>>>>> wrote:

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its
"own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

       Yes, this was my thought also. If the file is read-only, can
each process open it independently?

Right, this is what I did. However, multiple processes trying to read the
same file seems to introduce significantly more system overhead than
multiple processes reading different files.

       Are you opening the file with MPI or with the default HDF5 file
driver?

       It does sound more like a file system issue than something with
HDF5...

       Quinceyfor

- File-per-node is indeed a problem with varying concurrency. That's
why I
was thinking of a data-set-per-file organization. Each file needs to
be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

I'm now experimenting with the data-set-per-file approach. I have an
index file that contains (among other things) the file location of each data
set. The data sets are stored in an automatically created directory
hierarchy, so that no directory will have too many files.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor
of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Thanks for clarifying. That was my impression, but I wanted to make sure
before giving up on that route.

--
Mark

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Mark_Moll1 · June 25, 2009, 11:35pm

Since the mmap-based I/O seemed to work well in the old implementation, I was wondering if it is possible to "trick" HDF5 into using mmap. That is, use mmap to pretend a large HDF5 file resides in memory and somehow use the H5FD_CORE driver. Is this at all feasible? Does it have a chance of success? I'm really starting to grasp at straws now...

···

On Jun 25, 2009, at 9:20 AM, Mark Moll wrote:

[Just to summarize the discussion below: I have an MPI-based program that needs parallel, asynchronous, read-only access to different data sets. No two processes need access to the same data sets. In an old implementation without HDF5 this is done by dividing the data sets into a number of files that is a multiple of the number of processors and use mmap to access the data in each file. This is rather inflexible, so I started to look at HDF5. With HDF5 the program runs much slower.]

Initially, I thought the HDF5 version was much slower because all the data sets were stored in one file, but now I see the same thing happening when I put every data set into a separate file (# data sets >> # processors). I am trying to figure out what's going on with Shark and ThreadViewer on OS X. The profiles of the two versions of the program (with and without HDF5) look similar, except for some system calls in the HDF5 version. I actually hear the program hammering the disk in the HDF5 version, so I thought I'd use ThreadViewer to see whether the program was blocked. The picture below shows the difference between a thread in the HDF5 version (top) and the mmap-based version (bottom). Light green means the thread is running, dark means uninterruptible (usually system call), and yellow means recently running. Clearly, there's a huge difference, but I don't know how to dig any deeper to diagnose this issue.

<Picture 1.png>

I use OpenMPI 1.3.2, HDF5 1.8.3, and gcc 4.3.3 on OS X 10.5.7 (although I have observed the same behavior on Ubuntu Linux and RHEL with Intel and PGI compilers). I have also run the program through valgrind (on Linux and OS X) to see if that would turn up any suspicious activity, but to no avail.

On Jun 23, 2009, at 9:24 AM, Quincey Koziol wrote:

Hi Mark,

On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:

On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:

On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:

On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll@cs.rice.edu> >>>>> wrote:

Just a few clarifications:
- The reading of data sets is mostly asynchronous; each node reach its "own"
data sets.

Hmm, so each dataset belongs to only one node? It sounds like you
might not want to go the parallel HDF5 route. If the access is
asynchronous then you don't want to be using synchronized collective
calls.

  Yes, this was my thought also. If the file is read-only, can each process open it independently?

Right, this is what I did. However, multiple processes trying to read the same file seems to introduce significantly more system overhead than multiple processes reading different files.

  Are you opening the file with MPI or with the default HDF5 file driver?

  It does sound more like a file system issue than something with HDF5...

  Quinceyfor

- File-per-node is indeed a problem with varying concurrency. That's why I
was thinking of a data-set-per-file organization. Each file needs to be
accessed by only one node. Each node would have to read many files.

The disadvantage to dataset-per-file is that it could lead to lots of
files. We have had problems in lustre with using basic filesystem
commands (ls, mv, cp, etc.) on large collections of small files (in a
recent example, 400K files totaling 230TB). Those commands will fail
with an error like "argument list too long..." Although, this may be a
limitation of those commands that isn't just specific to lustre.

I'm now experimenting with the data-set-per-file approach. I have an index file that contains (among other things) the file location of each data set. The data sets are stored in an automatically created directory hierarchy, so that no directory will have too many files.

So the question is whether there is a performance difference between
parallel asynchronous reading of many files vs. one large file. I'm
guessing HDF5 does some preemptive fetching, which would work in favor of
one large file.

I don't think that HDF5 will do any speculative read-aheads within a
file, if that is what you mean by preemptive fetching. It will only
read the regions you specify with a hyperslab selection.

Collective/parallel access only makes sense if you have many nodes
selecting different hyperslabs within the same dataset. If each node
is loading a different dataset, I think that collective access will
lead to unnecessary MPI communication, which will only become worse as
you scale up.

Thanks for clarifying. That was my impression, but I wanted to make sure before giving up on that route.

--
Mark

--
Mark

--
Mark

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

data layout for parallel read-only access