round-robin (not parallel) access to single hdf5 file

I am playing with different I/O strategies for massively parallel multicore
systems. Let's say I have a MPI communicator which represents a subset of
cores (such as the cores on a SMP chip) and I want the root rank on that
communicator to create the file, but I want all of the other cores to write
to that file in a sequential (round-robin) manner.

Something like (fortran pseudocode)

if (myrank.eq.0)
   call h5open_f
   call h5fcreate_f (file_id)
   (write some metadata and 1D arrays)
endif

do irank=0,num_cores_per_chip
   if (myrank.eq. irank)
      call h5dwrite_f(dset_id,my_3d_array,dims)
   endif
enddo

if (myrank.eq.0) call h5close_f

My question is: can I do the round-robin write part without having to open
and close (h5fopen_f / h5fclose_f) the file for each core? It seems that
file_id (and the other id's like dset_id, dspace_id etc.) carries with it
something tangible that links that id to the file itself. I don't think I
can get away with doing MPI_BCAST of the file_id (and the other id
variables) to the other cores and use those handles to magically access the
file created on the root core... right? I am trying to avoid the overhead of
opening, seeking, closing for each core.

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

I am playing with different I/O strategies for massively parallel multicore
systems. Let's say I have a MPI communicator which represents a subset of
cores (such as the cores on a SMP chip) and I want the root rank on that
communicator to create the file, but I want all of the other cores to write
to that file in a sequential (round-robin) manner.

It sounds to me like you are maybe over-thinking this problem. You've
got an MPI program, so you've got an MPI library, and that MPI library
contains an MPI-IO implementation. It's likely that MPI-IO
implementation will take care of these problems for you.

...

My question is: can I do the round-robin write part without having to open
and close (h5fopen_f / h5fclose_f) the file for each core?

No, not at the HDF5 level. If you are on a file system where that is
possible, the MPI-IO library will do that for you.

It seems that file_id (and the other id's like dset_id, dspace_id
etc.) carries with it something tangible that links that id to the
file itself. I don't think I can get away with doing MPI_BCAST of
the file_id (and the other id variables) to the other cores and use
those handles to magically access the file created on the root
core... right? I am trying to avoid the overhead of opening,
seeking, closing for each core.

Here's what I'd suggest: structure your code for parallel HDF5 with
MPI-IO support and collecitve I/O enabled. Then the library,
whenever possible, will basically do the sorts of optimizations you're
thinking about. You do have to, via property lists, explicitly enable
MPI-IO support and collective I/O.

==rob

···

On Wed, Dec 08, 2010 at 03:38:42PM -0700, Leigh Orf wrote:

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Quincey, is H5Fcreate/open already using an MPI_BCAST under the covers?

If not, there is another optimization that I think was reported in a
paper on PLFS or Adios about passing a flag to the fopen call on each
MPI task that tells it not to update the creation/modification time
except on the root task. This can greatly reduce the load on the
metadata server for a parallel file system.

Mark

···

On Wed, Dec 8, 2010 at 5:38 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

My question is: can I do the round-robin write part without having to open
and close (h5fopen_f / h5fclose_f) the file for each core? It seems that
file_id (and the other id's like dset_id, dspace_id etc.) carries with it
something tangible that links that id to the file itself. I don't think I
can get away with doing MPI_BCAST of the file_id (and the other id
variables) to the other cores and use those handles to magically access the
file created on the root core... right? I am trying to avoid the overhead of
opening, seeking, closing for each core.

Thanks for the information. After I sent my email I realized I left out some
relevant information. I am not using pHDF5 but regular HDF5, but in a
parallel environment. The only reason I am doing this is because I want the
ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do
compression.

I currently have tried two approaches with compression and HDF5 in a
parallel environment: (1) Each MPI rank writes its own compressed HDF5 file.
(2) I create a new MPI communicator (call it subcomm) which operates on a
sub-block of the entire domain. Each instance of subcomm (which could, for
instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
subcomm, and that root core does the compression and writes to disk. The
problem with (1) is there are too many files with large simulations, the
problem with (2) is rank 0 is operating on a lot of data and the compression
code slows things down dramatically - rank 0 cranks away while the other
ranks are at a barrier. So I am trying a third approach where you still have
subcomm, but instead of doing the MPI_GATHER, each core writes, in a
round-robin fashion, to the file created by rank 0 of subcomm. I am hoping
that I'll get the benefits of compression (being done in parallel) and not
suffer a huge penalty for the round-robin approach.

If there were a way to do compressed pHDF5 I'd just do a hybrid approach
where each subcomm root node wrote (in parallel) to its HDF5 file. In this
case, I would presume that the computationally expensive compression
algorithms would be parallelized efficiently. Our goal is to reduce the
number of compressed hdf5 files. Not all the way to 1 file, but not 1 file
per MP1 rank. We are not using OpenMP and probably will not be in the
future.

Leigh

···

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

It is possible to implement the round-robin approach, and in fact this
is something we tried in the H5Part library for writing in MPI-IO
independent mode at high concurrency (>=16000 cores). The reason it
was necessary was to "throttle" the number of write requests hitting
the lustre filesystem to avoid timeout errors. In our case, we were
using 8 groups of 2000 cores in the round-robin write.

I agree with Rob that you should try collective mode first, but if you
still want to pursue the round-robin approach you can open the file on
all cores with the MPI-IO independent mode VFD, and then use the
pseudo-code you presented below with the loop over MPI rank (note that
you need to include a barrier in the loop, and that you will also
probably need a hyperslab selection based on rank before the write
call).

If you want only one file handle, you could use a hybrid-parallel
approach where there is only one MPI task on the node, and each core
launches a separate pthread or OpenMP thread. Then you would open the
file once per node, and you could loop over thread ID within each MPI
task.

In our case, every core had an MPI task, so we passed a token using
point-to-point send/receives. I pasted some code snippets of this
below.

Mark

···

-----

h5part_int64_t
_H5Part_start_throttle (
        H5PartFile *f
        ) {

        if (f->throttle > 0) {
                int ret;
                int token = 1;
                _H5Part_print_info (
                        "Throttling with factor = %d",
                        f->throttle);
                if (f->myproc / f->throttle > 0) {
                        _H5Part_print_debug_detail (
                                "[%d] throttle: waiting on token from %d",
                                f->myproc, f->myproc - f->throttle);
                        // wait to receive token before continuing with read
                        ret = MPI_Recv(
                                &token, 1, MPI_INT,
                                f->myproc - f->throttle, // receive
from previous proc
                                f->myproc, // use this proc id as message tag
                                f->comm,
                                MPI_STATUS_IGNORE
                                );
                        if ( ret != MPI_SUCCESS ) return
HANDLE_MPI_SENDRECV_ERR;
                }
                _H5Part_print_debug_detail (
                        "[%d] throttle: received token",
                        f->myproc);
        }
        return H5PART_SUCCESS;
}

h5part_int64_t
_H5Part_end_throttle (
        H5PartFile *f
        ) {

        if (f->throttle > 0) {
                int ret;
                int token;
                if (f->myproc + f->throttle < f->nprocs) {
                        // pass token to next proc
                        _H5Part_print_debug_detail (
                                "[%d] throttle: passing token to %d",
                                f->myproc, f->myproc + f->throttle);
                        ret = MPI_Send(
                                &token, 1, MPI_INT,
                                f->myproc + f->throttle, // send to next proc
                                f->myproc + f->throttle, // use the id
of the target as tag
                                f->comm
                                );
                        if ( ret != MPI_SUCCESS ) return
HANDLE_MPI_SENDRECV_ERR;
                }
        }
        return H5PART_SUCCESS;
}

#ifdef PARALLEL_IO
        herr = _H5Part_start_throttle ( f );
        if ( herr < 0 ) return herr;
#endif

        herr = H5Dwrite (
                dataset_id,
                type,
                f->memshape,
                f->diskshape,
                f->xfer_prop,
                array );

#ifdef PARALLEL_IO
        herr = _H5Part_end_throttle ( f );
        if ( herr < 0 ) return herr;
#endif

On Thu, Dec 9, 2010 at 10:55 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Wed, Dec 08, 2010 at 03:38:42PM -0700, Leigh Orf wrote:

I am playing with different I/O strategies for massively parallel multicore
systems. Let's say I have a MPI communicator which represents a subset of
cores (such as the cores on a SMP chip) and I want the root rank on that
communicator to create the file, but I want all of the other cores to write
to that file in a sequential (round-robin) manner.

It sounds to me like you are maybe over-thinking this problem. You've
got an MPI program, so you've got an MPI library, and that MPI library
contains an MPI-IO implementation. It's likely that MPI-IO
implementation will take care of these problems for you.

...

My question is: can I do the round-robin write part without having to open
and close (h5fopen_f / h5fclose_f) the file for each core?

No, not at the HDF5 level. If you are on a file system where that is
possible, the MPI-IO library will do that for you.

It seems that file_id (and the other id's like dset_id, dspace_id
etc.) carries with it something tangible that links that id to the
file itself. I don't think I can get away with doing MPI_BCAST of
the file_id (and the other id variables) to the other cores and use
those handles to magically access the file created on the root
core... right? I am trying to avoid the overhead of opening,
seeking, closing for each core.

Here's what I'd suggest: structure your code for parallel HDF5 with
MPI-IO support and collecitve I/O enabled. Then the library,
whenever possible, will basically do the sorts of optimizations you're
thinking about. You do have to, via property lists, explicitly enable
MPI-IO support and collective I/O.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

What if you still did collective writes with parallel-HDF5, but you
did a little additional work in the application. If you compress each
portion of data on each MPI rank, then ask HDF5 to write out that
compressed buffer, blammo, you get parallel compression and parallel
I/O. It's not as seamless as if you asked HDF5 to do the compression
for you: I guess you'd have to find a stream-based compression
algorithm (gzip?) that can work on concatenated blocks, and annotate
the dataset with the compression algorithm you selected.

==rob

···

On Thu, Dec 09, 2010 at 10:57:28AM -0700, Leigh Orf wrote:

Thanks for the information. After I sent my email I realized I left out some
relevant information. I am not using pHDF5 but regular HDF5, but in a
parallel environment. The only reason I am doing this is because I want the
ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do
compression.

I currently have tried two approaches with compression and HDF5 in a
parallel environment: (1) Each MPI rank writes its own compressed HDF5 file.
(2) I create a new MPI communicator (call it subcomm) which operates on a
sub-block of the entire domain. Each instance of subcomm (which could, for
instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
subcomm, and that root core does the compression and writes to disk.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Mark,

My question is: can I do the round-robin write part without having to open
and close (h5fopen_f / h5fclose_f) the file for each core? It seems that
file_id (and the other id's like dset_id, dspace_id etc.) carries with it
something tangible that links that id to the file itself. I don't think I
can get away with doing MPI_BCAST of the file_id (and the other id
variables) to the other cores and use those handles to magically access the
file created on the root core... right? I am trying to avoid the overhead of
opening, seeking, closing for each core.

Quincey, is H5Fcreate/open already using an MPI_BCAST under the covers?

  Yes, in certain circumstances.

If not, there is another optimization that I think was reported in a
paper on PLFS or Adios about passing a flag to the fopen call on each
MPI task that tells it not to update the creation/modification time
except on the root task. This can greatly reduce the load on the
metadata server for a parallel file system.

  Interesting, can you send me a reference for this?

  Quincey

···

On Dec 9, 2010, at 10:13 AM, Mark Howison wrote:

On Wed, Dec 8, 2010 at 5:38 PM, Leigh Orf <leigh.orf@gmail.com> wrote:

Hi Leigh,

Thanks for the information. After I sent my email I realized I left out some relevant information. I am not using pHDF5 but regular HDF5, but in a parallel environment. The only reason I am doing this is because I want the ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit, etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do compression.

  We are working toward it, but it's going to be about a year away.

I currently have tried two approaches with compression and HDF5 in a parallel environment: (1) Each MPI rank writes its own compressed HDF5 file. (2) I create a new MPI communicator (call it subcomm) which operates on a sub-block of the entire domain. Each instance of subcomm (which could, for instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of subcomm, and that root core does the compression and writes to disk. The problem with (1) is there are too many files with large simulations, the problem with (2) is rank 0 is operating on a lot of data and the compression code slows things down dramatically - rank 0 cranks away while the other ranks are at a barrier. So I am trying a third approach where you still have subcomm, but instead of doing the MPI_GATHER, each core writes, in a round-robin fashion, to the file created by rank 0 of subcomm. I am hoping that I'll get the benefits of compression (being done in parallel) and not suffer a huge penalty for the round-robin approach.

If there were a way to do compressed pHDF5 I'd just do a hybrid approach where each subcomm root node wrote (in parallel) to its HDF5 file. In this case, I would presume that the computationally expensive compression algorithms would be parallelized efficiently. Our goal is to reduce the number of compressed hdf5 files. Not all the way to 1 file, but not 1 file per MP1 rank. We are not using OpenMP and probably will not be in the future.

  The primary problem is the space allocation that has to happen when data is compressed. This is particularly a problem when performing independent I/O, since the other processes aren't involved, but [eventually] need to know about space that was allocated. Collective I/O is easier, but still will require changes to HDF5, etc. Are you wanting to use collective or independent I/O for your dataset writing?

  Quincey

···

On Dec 9, 2010, at 11:57 AM, Leigh Orf wrote:

> Thanks for the information. After I sent my email I realized I left out
some
> relevant information. I am not using pHDF5 but regular HDF5, but in a
> parallel environment. The only reason I am doing this is because I want
the
> ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
> etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot
do
> compression.

> I currently have tried two approaches with compression and HDF5 in a
> parallel environment: (1) Each MPI rank writes its own compressed HDF5
file.
> (2) I create a new MPI communicator (call it subcomm) which operates on a
> sub-block of the entire domain. Each instance of subcomm (which could,
for
> instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
> subcomm, and that root core does the compression and writes to disk.

What if you still did collective writes with parallel-HDF5, but you
did a little additional work in the application. If you compress each
portion of data on each MPI rank, then ask HDF5 to write out that
compressed buffer, blammo, you get parallel compression and parallel
I/O. It's not as seamless as if you asked HDF5 to do the compression
for you: I guess you'd have to find a stream-based compression
algorithm (gzip?) that can work on concatenated blocks, and annotate
the dataset with the compression algorithm you selected.

I'd really like to be able to have HDF5 do the compression because I have
grown quite accustomed to how transparent it is. The filters are just
activated, and regardless of how you compress the data, you 'see' floating
point data when you open the file or run h5dump or whatever.

I could code things up to have each core open each hdf5 file, write each
part of its file, close it, and hand it off to the next guy, but I just have
to believe that's going to be really inefficient. It seems there should be a
way to do this by passing file handles or property lists from one MPI
process to another.

I did find this page called "Collective HDF5 Calls in Parallel" which is
interesting but it is unclear to me whether it applies to pHDF5 or just
plain HDF5.

Leigh

···

On Thu, Dec 9, 2010 at 11:40 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Dec 09, 2010 at 10:57:28AM -0700, Leigh Orf wrote:

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Following up my own post.

I am fully convinced now that you can't run hdf5's compression code in
parallel, where the number of cores compressing is greater than the number
of files being written. You can only run the compression filters 1:1 to each
compressed file. I wrote up my sequential code and quickly realized that
while I can write it such that each core will compress the data, the
compression itself will happen sequentially, not in parallel. So I end up
essentially with the same inefficiency I had before. i.e.

     rank2=2
     call MPI_INIT( error )
     call MPI_COMM_RANK( MPI_COMM_WORLD, myid, error )
     call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, error )
     CALL h5open_f(error)
     NX=500000
     NY=numprocs
     allocate ( derps(NX) )
     do i=1,NX
        call random_number(rand)
        derps(i) = 10000 * rand
     enddo
     dims_full(1) = NX
     dims_full(2) = NY
     chunkdims(1)=NX
     chunkdims(2)=NY
     count(1) = NX
     count(2) = 1
     offset_in(1)=0
     offset_in(2)=0

     call h5screate_simple_f(rank2,count,subspace_id,error) !partial
     call h5screate_simple_f(rank2,dims_full,fullspace_id,error) !full
     call h5pcreate_f(H5P_DATASET_CREATE_F,chunk_id,error)
     call h5pset_chunk_f(chunk_id,rank2,chunkdims,error)
     call h5pset_deflate_f(chunk_id,4,error)

!Loop over ranks, each rank writing to the same file

     do i=0,numprocs-1
           if (myid.eq.i) then
                 offset_out(1)=0
                 offset_out(2)=i
                 if (myid.eq.0) then
                       CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error)
                 else
                       CALL h5fopen_f(filename, H5F_ACC_RDWR_F, file_id,
error)
                 endif
                 if (myid.eq.0) then
                       CALL h5dcreate_f(file_id, dsetname, H5T_NATIVE_REAL,
fullspace_id, dset_id,error,chunk_id)
                 else
                       call h5dopen_f(file_id,dsetname,dset_id,error)
                 endif
                 call
h5sselect_hyperslab_f(subspace_id,H5S_SELECT_SET_F,offset_in,count,error)
                 call
h5sselect_hyperslab_f(fullspace_id,H5S_SELECT_SET_F,offset_out,count,error)
                 call
h5dwrite_f(dset_id,H5T_NATIVE_REAL,derps,count,error,subspace_id,fullspace_id)
                 CALL h5dclose_f(dset_id, error)
                 CALL h5fclose_f(file_id, error)
           endif
           call mpi_barrier(MPI_COMM_WORLD,error)
     enddo
     CALL h5sclose_f(subspace_id, error)
     CALL h5sclose_f(fullspace_id, error)

     CALL h5close_f(error)
     call mpi_finalize(rc)

Even though the h5pset**** stuff is outside of the i loop, the compression
filters are apparently triggered during h5dwrite_f - which must be in the i
loop. I presume there is no way to change that behavior in userspace.

I guess what Rob suggested is the only way to go for doing parallel
compression with hdf5. There just doesn't appear to be a way to have each
mpi process doing compression in parallel. I can live with the writing being
sequential, as that happens pretty fast.

···

On Thu, Dec 9, 2010 at 11:52 AM, Leigh Orf <leigh.orf@gmail.com> wrote:

On Thu, Dec 9, 2010 at 11:40 AM, Rob Latham <robl@mcs.anl.gov> wrote:

On Thu, Dec 09, 2010 at 10:57:28AM -0700, Leigh Orf wrote:
> Thanks for the information. After I sent my email I realized I left out
some
> relevant information. I am not using pHDF5 but regular HDF5, but in a
> parallel environment. The only reason I am doing this is because I want
the
> ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
> etc.). As I understand it, at this point (and maybe forever) pHDF5
cannot do
> compression.

> I currently have tried two approaches with compression and HDF5 in a
> parallel environment: (1) Each MPI rank writes its own compressed HDF5
file.
> (2) I create a new MPI communicator (call it subcomm) which operates on
a
> sub-block of the entire domain. Each instance of subcomm (which could,
for
> instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
> subcomm, and that root core does the compression and writes to disk.

What if you still did collective writes with parallel-HDF5, but you
did a little additional work in the application. If you compress each
portion of data on each MPI rank, then ask HDF5 to write out that
compressed buffer, blammo, you get parallel compression and parallel
I/O. It's not as seamless as if you asked HDF5 to do the compression
for you: I guess you'd have to find a stream-based compression
algorithm (gzip?) that can work on concatenated blocks, and annotate
the dataset with the compression algorithm you selected.

I'd really like to be able to have HDF5 do the compression because I have
grown quite accustomed to how transparent it is. The filters are just
activated, and regardless of how you compress the data, you 'see' floating
point data when you open the file or run h5dump or whatever.

I could code things up to have each core open each hdf5 file, write each
part of its file, close it, and hand it off to the next guy, but I just have
to believe that's going to be really inefficient. It seems there should be a
way to do this by passing file handles or property lists from one MPI
process to another.

I did find this page called "Collective HDF5 Calls in Parallel" which is
interesting but it is unclear to me whether it applies to pHDF5 or just
plain HDF5.

Leigh

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

> Thanks for the information. After I sent my email I realized I left out some
> relevant information. I am not using pHDF5 but regular HDF5, but in a
> parallel environment. The only reason I am doing this is because I want the
> ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
> etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do
> compression.

> I currently have tried two approaches with compression and HDF5 in a
> parallel environment: (1) Each MPI rank writes its own compressed HDF5 file.
> (2) I create a new MPI communicator (call it subcomm) which operates on a
> sub-block of the entire domain. Each instance of subcomm (which could, for
> instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
> subcomm, and that root core does the compression and writes to disk.

What if you still did collective writes with parallel-HDF5, but you
did a little additional work in the application. If you compress each
portion of data on each MPI rank, then ask HDF5 to write out that
compressed buffer, blammo, you get parallel compression and parallel
I/O. It's not as seamless as if you asked HDF5 to do the compression
for you: I guess you'd have to find a stream-based compression
algorithm (gzip?) that can work on concatenated blocks, and annotate
the dataset with the compression algorithm you selected.

I'd really like to be able to have HDF5 do the compression because I have grown quite accustomed to how transparent it is. The filters are just activated, and regardless of how you compress the data, you 'see' floating point data when you open the file or run h5dump or whatever.

I could code things up to have each core open each hdf5 file, write each part of its file, close it, and hand it off to the next guy, but I just have to believe that's going to be really inefficient. It seems there should be a way to do this by passing file handles or property lists from one MPI process to another.

  We've been working on some changes that allow the metadata cache to be flushed and refreshed without re-opening the whole file. That might apply here, but we weren't thinking about concurrent processes writing to the file, so it might require further work to support that use case. We'd also have to update the file free space information, which would be more work. Hmm... I'll bounce it around with some of the DOE folks who may be able to help fund this, but it would be a good use case for getting some funding to support Blue Waters, also.

I did find this page called "Collective HDF5 Calls in Parallel" which is interesting but it is unclear to me whether it applies to pHDF5 or just plain HDF5.

  That applies to using HDF5 with the MPI-IO or MPI-POSIX VFDs (which implicitly requires a parallel program using MPI).

  Quincey

···

On Dec 9, 2010, at 12:52 PM, Leigh Orf wrote:

On Thu, Dec 9, 2010 at 11:40 AM, Rob Latham <robl@mcs.anl.gov> wrote:
On Thu, Dec 09, 2010 at 10:57:28AM -0700, Leigh Orf wrote:

Leigh

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I've had to deal with the space allocation issue for different reasons
using custom compression filters (Peter Lindstrum's FPZIP and HZIP for
structured meshes of hexs or tets and variables thereon).

I think HDF5 lib could 'solve' the allocation problem using an approach
I took. However, you do have to 'get comfortable' with the idea that you
might not utilize space in file 100% optimally.

Here is how it would work. Define a target compression ratio, R:1, that
*must* be achieved for a given dataset. If the dataset is N bytes
uncompressed, it will be NO MORE than N/R bytes compressed. Allocate N/R
bytes in the file for this dataset. If you succeed in compressing by at
least a ratio of R, your golden. If not, fail the write and return an
'unable to compress to target ratio' error. The caller can decide to try
again with a different target ratio (which will probably require some
collective communication as all procs will need to know the newer size).

If you succeed and compress by MORE than ratio of R, you waste some
space in the file. So what. Disk is cheap!

Sometimes, you can take a small sample of the dataset (say the first M
bytes, or some bytes from beginning, middle and end), compress it
yourself to get an approximate idea of how 'compressible' it might be
and then set R based on that quick approximation. In addition, if HDF5
returned to you information on how 'well' it was doing relative to
compression targets (something like 'did better than target ratio by
10%' or 'missed target ratio by 3 %'), you can adjust target ratio as
necessary.

Mark

···

On Tue, 2010-12-14 at 05:40 -0800, Quincey Koziol wrote:

  The primary problem is the space allocation that has to happen when
data is compressed. This is particularly a problem when performing
independent I/O, since the other processes aren't involved, but
[eventually] need to know about space that was allocated. Collective
I/O is easier, but still will require changes to HDF5, etc. Are you
wanting to use collective or independent I/O for your dataset writing?

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

I'm pretty sure the trick was to use O_NOATIME in the open() call,
except on task 0. (You can find this on NICS webpage on I/O best
practices.)

I know I came across it in the lit review for our HDF5/Lustre paper,
but I can't put my fingers on the paper. It might haveI vaguely recall
a scaling graph showing how this outperformed a regular open() call,
and I think the text was 1-column wide... I'll keep looking through my
woefully unorganized pile of PDFs.

Mark

···

On Tue, Dec 14, 2010 at 8:08 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

If not, there is another optimization that I think was reported in a
paper on PLFS or Adios about passing a flag to the fopen call on each
MPI task that tells it not to update the creation/modification time
except on the root task. This can greatly reduce the load on the
metadata server for a parallel file system.

   Interesting, can you send me a reference for this?

Hi Leigh,

> Thanks for the information. After I sent my email I realized I left out
some relevant information. I am not using pHDF5 but regular HDF5, but in a
parallel environment. The only reason I am doing this is because I want the
ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit,
etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do
compression.

        We are working toward it, but it's going to be about a year away.

> I currently have tried two approaches with compression and HDF5 in a
parallel environment: (1) Each MPI rank writes its own compressed HDF5 file.
(2) I create a new MPI communicator (call it subcomm) which operates on a
sub-block of the entire domain. Each instance of subcomm (which could, for
instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of
subcomm, and that root core does the compression and writes to disk. The
problem with (1) is there are too many files with large simulations, the
problem with (2) is rank 0 is operating on a lot of data and the compression
code slows things down dramatically - rank 0 cranks away while the other
ranks are at a barrier. So I am trying a third approach where you still have
subcomm, but instead of doing the MPI_GATHER, each core writes, in a
round-robin fashion, to the file created by rank 0 of subcomm. I am hoping
that I'll get the benefits of compression (being done in parallel) and not
suffer a huge penalty for the round-robin approach.
>
> If there were a way to do compressed pHDF5 I'd just do a hybrid approach
where each subcomm root node wrote (in parallel) to its HDF5 file. In this
case, I would presume that the computationally expensive compression
algorithms would be parallelized efficiently. Our goal is to reduce the
number of compressed hdf5 files. Not all the way to 1 file, but not 1 file
per MP1 rank. We are not using OpenMP and probably will not be in the
future.

        The primary problem is the space allocation that has to happen when
data is compressed. This is particularly a problem when performing
independent I/O, since the other processes aren't involved, but [eventually]
need to know about space that was allocated. Collective I/O is easier, but
still will require changes to HDF5, etc. Are you wanting to use collective
or independent I/O for your dataset writing?

Quincey,

Probably a combination of both, namely, an ideal situation would be a group
of MPI ranks collectively writing one compressed HDF5 file. On Blue Waters a
100kcore run with 32 cores/MCM could therefore result in say around 3000
files, which is not unreasonable.

Maybe I'm thinking about this too simply, but couldn't you compress the data
on each MPI rank, save it in a buffer, calculate the space required, and the
write it? I don't know enough about the internal workings of hdf5 to know
whether that would fit in the HDF5 model. In our particular application on
Blue Waters, memory is cheap, so there is lots of space in memory for
buffering data.

Leigh

···

On Tue, Dec 14, 2010 at 6:40 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Dec 9, 2010, at 11:57 AM, Leigh Orf wrote:

       Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

A Friday 10 December 2010 01:13:50 Leigh Orf escrigué:

I am fully convinced now that you can't run hdf5's compression code
in parallel, where the number of cores compressing is greater than
the number of files being written. You can only run the compression
filters 1:1 to each compressed file. I wrote up my sequential code
and quickly realized that while I can write it such that each core
will compress the data, the compression itself will happen
sequentially, not in parallel.

In case you want to make use of all your cores, you may want to use
Blosc (http://blosc.pytables.org/trac), which allows to use any number
of cores (the number is configurable) by using multithreading. It uses
a static pool of threads, so it is pretty efficient (although it works
best for relatively large chunksizes, typically >= 1 MB).

···

--
Francesc Alted

I guess I neglected to add...

My comments assume a common use case; that datasets are stored
CONTIGUOUSLY (not blocked). HDF5 lib doesn't support filters on CONTIG
datasets presently. I'd like to see that change.

For a lot of HPC applications using Poor Man's Parallel I/O, datasets
are written in their entirety in a single H5Dwrite call and DO NOT
require an special 'arranging' in the file to optimize for possible,
future partial reads (e.g. blocking) either. They are read back in their
entirety in a single H5Dread call.

Nonetheless, even if datasets are blocked, then the allocation problem
(and my pseudo-solution below) need to work on a block-by-block basis.

Mark

···

On Tue, 2010-12-14 at 08:45 -0800, Mark Miller wrote:

On Tue, 2010-12-14 at 05:40 -0800, Quincey Koziol wrote:

> The primary problem is the space allocation that has to happen when
> data is compressed. This is particularly a problem when performing
> independent I/O, since the other processes aren't involved, but
> [eventually] need to know about space that was allocated. Collective
> I/O is easier, but still will require changes to HDF5, etc. Are you
> wanting to use collective or independent I/O for your dataset writing?
>

I've had to deal with the space allocation issue for different reasons
using custom compression filters (Peter Lindstrum's FPZIP and HZIP for
structured meshes of hexs or tets and variables thereon).

I think HDF5 lib could 'solve' the allocation problem using an approach
I took. However, you do have to 'get comfortable' with the idea that you
might not utilize space in file 100% optimally.

Here is how it would work. Define a target compression ratio, R:1, that
*must* be achieved for a given dataset. If the dataset is N bytes
uncompressed, it will be NO MORE than N/R bytes compressed. Allocate N/R
bytes in the file for this dataset. If you succeed in compressing by at
least a ratio of R, your golden. If not, fail the write and return an
'unable to compress to target ratio' error. The caller can decide to try
again with a different target ratio (which will probably require some
collective communication as all procs will need to know the newer size).

If you succeed and compress by MORE than ratio of R, you waste some
space in the file. So what. Disk is cheap!

Sometimes, you can take a small sample of the dataset (say the first M
bytes, or some bytes from beginning, middle and end), compress it
yourself to get an approximate idea of how 'compressible' it might be
and then set R based on that quick approximation. In addition, if HDF5
returned to you information on how 'well' it was doing relative to
compression targets (something like 'did better than target ratio by
10%' or 'missed target ratio by 3 %'), you can adjust target ratio as
necessary.

Mark

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

neat trick. Can definitely see how this might help Lustre. Not sure
if it will help GPFS or PVFS as much. Hope you find that PDF!

This sort of trick definitely belongs in the MPI-IO library, but
O_NOATIME on my system is protected by "#ifdef __USE_GNU".

For the sake of portability, we try to avoid non-standard flags in MPI
land, but this optimization is easy and presumably worthwhile so I'll
ask our hard-nosed "portability guys" how we can use this.

==rob

···

On Tue, Dec 14, 2010 at 12:03:05PM -0500, Mark Howison wrote:

On Tue, Dec 14, 2010 at 8:08 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:
>> If not, there is another optimization that I think was reported in a
>> paper on PLFS or Adios about passing a flag to the fopen call on each
>> MPI task that tells it not to update the creation/modification time
>> except on the root task. This can greatly reduce the load on the
>> metadata server for a parallel file system.
>
> Interesting, can you send me a reference for this?

I'm pretty sure the trick was to use O_NOATIME in the open() call,
except on task 0. (You can find this on NICS webpage on I/O best
practices.)

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Leigh,

···

On Dec 14, 2010, at 4:52 PM, Leigh Orf wrote:

On Tue, Dec 14, 2010 at 6:40 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:
Hi Leigh,

On Dec 9, 2010, at 11:57 AM, Leigh Orf wrote:

> Thanks for the information. After I sent my email I realized I left out some relevant information. I am not using pHDF5 but regular HDF5, but in a parallel environment. The only reason I am doing this is because I want the ability to write compressed HDF5 files (gzip, szip, scale-offset, nbit, etc.). As I understand it, at this point (and maybe forever) pHDF5 cannot do compression.

       We are working toward it, but it's going to be about a year away.

> I currently have tried two approaches with compression and HDF5 in a parallel environment: (1) Each MPI rank writes its own compressed HDF5 file. (2) I create a new MPI communicator (call it subcomm) which operates on a sub-block of the entire domain. Each instance of subcomm (which could, for instance, operate on one multicore chip) does a MPI_GATHER to rank 0 of subcomm, and that root core does the compression and writes to disk. The problem with (1) is there are too many files with large simulations, the problem with (2) is rank 0 is operating on a lot of data and the compression code slows things down dramatically - rank 0 cranks away while the other ranks are at a barrier. So I am trying a third approach where you still have subcomm, but instead of doing the MPI_GATHER, each core writes, in a round-robin fashion, to the file created by rank 0 of subcomm. I am hoping that I'll get the benefits of compression (being done in parallel) and not suffer a huge penalty for the round-robin approach.
>
> If there were a way to do compressed pHDF5 I'd just do a hybrid approach where each subcomm root node wrote (in parallel) to its HDF5 file. In this case, I would presume that the computationally expensive compression algorithms would be parallelized efficiently. Our goal is to reduce the number of compressed hdf5 files. Not all the way to 1 file, but not 1 file per MP1 rank. We are not using OpenMP and probably will not be in the future.

       The primary problem is the space allocation that has to happen when data is compressed. This is particularly a problem when performing independent I/O, since the other processes aren't involved, but [eventually] need to know about space that was allocated. Collective I/O is easier, but still will require changes to HDF5, etc. Are you wanting to use collective or independent I/O for your dataset writing?

Quincey,

Probably a combination of both, namely, an ideal situation would be a group of MPI ranks collectively writing one compressed HDF5 file. On Blue Waters a 100kcore run with 32 cores/MCM could therefore result in say around 3000 files, which is not unreasonable.

Maybe I'm thinking about this too simply, but couldn't you compress the data on each MPI rank, save it in a buffer, calculate the space required, and the write it? I don't know enough about the internal workings of hdf5 to know whether that would fit in the HDF5 model. In our particular application on Blue Waters, memory is cheap, so there is lots of space in memory for buffering data.

  What you say above is basically what happens, except that space in the file needs to be allocated for each block of compressed data. Since each block is not the same size, the HDF5 library can't pre-allocate the space or algorithmically determine how much to reserve for each process. In the case of collective I/O, at least it's theoretically possible for all the processes to communicate and work it out, but I'm not certain it's going to be solvable for independent I/O, unless we reserve some processes to either allocate space (like a "free space server") or buffer the "I/O", etc.

  Quincey

Hi Leigh,

[snipped for brevity]

Quincey,

Probably a combination of both, namely, an ideal situation would be a group
of MPI ranks collectively writing one compressed HDF5 file. On Blue Waters a
100kcore run with 32 cores/MCM could therefore result in say around 3000
files, which is not unreasonable.

Maybe I'm thinking about this too simply, but couldn't you compress the
data on each MPI rank, save it in a buffer, calculate the space required,
and the write it? I don't know enough about the internal workings of hdf5 to
know whether that would fit in the HDF5 model. In our particular application
on Blue Waters, memory is cheap, so there is lots of space in memory for
buffering data.

What you say above is basically what happens, except that space in the file
needs to be allocated for each block of compressed data. Since each block
is not the same size, the HDF5 library can't pre-allocate the space or
algorithmically determine how much to reserve for each process. In the case
of collective I/O, at least it's theoretically possible for all the
processes to communicate and work it out, but I'm not certain it's going to
be solvable for independent I/O, unless we reserve some processes to either
allocate space (like a "free space server") or buffer the "I/O", etc.

Could you make this work by forcing each core to have some specific chunking
arrangement? For instance, you could have each core's dimension simply be
the same dimension as each chunk, which actually works out pretty well in my
application, at least in the horizontal. I typically have nxchunk=nx,
nychunk=ny, and nzchunk to be something like 20 or so. But - now that I
think about it, even if that were the case you don't know the size of the
compressed chunks until you've compressed them and you'd still need to
communicate the size of the compressed chunks amongst cores writing to an
individual file.

I don't know enough about hdf5 to understand how the preallocation process
works. It sounds like you are allocating a bunch of zeroes (or something) on
disk first, and then doing I/O straight to that space on disk? If this is
the case then I can see how this necessitates some kind of collective
communication if you are splitting up compression amongst MPI ranks.

Personally I am perfectly happy with a bit of overhead which forces all
cores to share amongst themselves what the compressed block size is before
writing if it means we can do compression. Right now I see my choices as
being (1) compressed, but 1 file per MPI rank, lots of files (2) No
compression, fewer files, but perhaps compressing later on using h5repack,
calling it in parallel, one h5repack per MPI rank as a post-processing step
(yuck!).

I'm glad you're working on this, personally I think this is important stuff
for really huge simulations. In talking to other folks who will be using
Blue Waters, compression is not much of an issue with many of them because
of the nature of their data. Cloud data especially tends to compress very
well. It would be a shame to fill terabytes of disk space with zeroes! I am
sure we can still carry out our research objectives without compression, but
the sheer amount of data we will be producing is staggering even with
compression.

Leigh

···

On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

Hi Leigh,

I guess I am still interested to know whether an approach where
specifying a minimum target compression ratio and then allowing HDF5 to
(possibly over) allocate assuming a max. compressed size would work for
you?

Mark

···

On Wed, 2010-12-15 at 10:59, Leigh Orf wrote:

On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol <koziol@hdfgroup.org> > wrote:
        Hi Leigh,
        
[snipped for brevity]

        > Quincey,
        >
        > Probably a combination of both, namely, an ideal situation
        > would be a group of MPI ranks collectively writing one
        > compressed HDF5 file. On Blue Waters a 100kcore run with 32
        > cores/MCM could therefore result in say around 3000 files,
        > which is not unreasonable.
        >
        > Maybe I'm thinking about this too simply, but couldn't you
        > compress the data on each MPI rank, save it in a buffer,
        > calculate the space required, and the write it? I don't know
        > enough about the internal workings of hdf5 to know whether
        > that would fit in the HDF5 model. In our particular
        > application on Blue Waters, memory is cheap, so there is
        > lots of space in memory for buffering data.
        >
        
        What you say above is basically what happens, except that
        space in the file needs to be allocated for each block of
        compressed data. Since each block is not the same size, the
        HDF5 library can't pre-allocate the space or algorithmically
        determine how much to reserve for each process. In the case
        of collective I/O, at least it's theoretically possible for
        all the processes to communicate and work it out, but I'm not
        certain it's going to be solvable for independent I/O, unless
        we reserve some processes to either allocate space (like a
        "free space server") or buffer the "I/O", etc.

Could you make this work by forcing each core to have some specific
chunking arrangement? For instance, you could have each core's
dimension simply be the same dimension as each chunk, which actually
works out pretty well in my application, at least in the horizontal. I
typically have nxchunk=nx, nychunk=ny, and nzchunk to be something
like 20 or so. But - now that I think about it, even if that were the
case you don't know the size of the compressed chunks until you've
compressed them and you'd still need to communicate the size of the
compressed chunks amongst cores writing to an individual file.

I don't know enough about hdf5 to understand how the preallocation
process works. It sounds like you are allocating a bunch of zeroes (or
something) on disk first, and then doing I/O straight to that space on
disk? If this is the case then I can see how this necessitates some
kind of collective communication if you are splitting up compression
amongst MPI ranks.

Personally I am perfectly happy with a bit of overhead which forces
all cores to share amongst themselves what the compressed block size
is before writing if it means we can do compression. Right now I see
my choices as being (1) compressed, but 1 file per MPI rank, lots of
files (2) No compression, fewer files, but perhaps compressing later
on using h5repack, calling it in parallel, one h5repack per MPI rank
as a post-processing step (yuck!).

I'm glad you're working on this, personally I think this is important
stuff for really huge simulations. In talking to other folks who will
be using Blue Waters, compression is not much of an issue with many of
them because of the nature of their data. Cloud data especially tends
to compress very well. It would be a shame to fill terabytes of disk
space with zeroes! I am sure we can still carry out our research
objectives without compression, but the sheer amount of data we will
be producing is staggering even with compression.

Leigh

        Quincey
        
        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@hdfgroup.org
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
        
--
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric
Research in Boulder, CO
NCAR office phone: (303) 497-8200

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511