HDF5 - Parallel write - selection using hyperslabs - slow write

Dear HDF Forum users.

In my program I started using HDF5 1.8.2 and hyperslabs to write distributed data to a single output file a few weeks ago.

The data is a Nx3 matrix (N is known at runtime).

What am I doing (see code below):
    Opening a group. Creating a dataset for all the data. Each process selects via hyperslabs where to write the data to (position).
    The selected hyperslabs of two processes contain overlapping portions (exactly the same data in the overlapping parts).
    After that I do a collective write call. It is this call, which consumes a lot of time (approx. 10 min for N=70e3).
Making the hypersalbs non-overlapping (takes less than a second) before does not change the execution time significantly.

The program is being executed on a cluster having gpfs available and enabled with mpirun and ~100 processes,
but might also be executed using other (non-parallel) file systems.

Does anyone have a hint for me?
Maybe I made a simple rookie mistake, which I just can't find.

I would not write this post if I had found the answer in manuals, tutorials or via google.

Thanks!

Best Regards,
     Markus

Code-snippet:

···

##################
     group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

     dataspace_id = H5Screate_simple(2, dims, NULL);
     dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coord_data_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
     status = H5Sclose(dataspace_id);

     hid_t filespace = H5Dget_space(dataset_id);

     slab_dims[0] = N; slab_dims[1] = 3;
     dataspace_id = H5Screate_simple(2, slab_dims, NULL);

     slab_dims[0] = 1 ;
     for (int j = 0; j < N; j++)
     {
       offset[0] = this->_coord_map[j]; // figure out where to write the slab to
       // write the row as hyperslab to file
       // with partially overlapping data (the overlapping portions contain the same numbers/data).
        if (j==0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, slab_dims, NULL);
        else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, offset, NULL, slab_dims, NULL);
     }
     hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
     H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

     // This is the slow part:
     status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);

     status = H5Pclose(plist_id);
     status = H5Sclose(dataspace_id);

     H5Sclose(filespace);
     status = H5Dclose(dataset_id);
     status = H5Gclose(group_id);
#####################

Hi Markus,
  Seems like it should be working, but can you upgrade to 1.8.8 (or the 1.8.9 prerelease) and use the H5Pget_mpio_actual_io_mode() routine to see if collective I/O is occurring? Further actions will depend on the results from that routine...

  Quincey

···

On Apr 24, 2012, at 8:41 AM, Markus Bina wrote:

Dear HDF Forum users.

In my program I started using HDF5 1.8.2 and hyperslabs to write distributed data to a single output file a few weeks ago.

The data is a Nx3 matrix (N is known at runtime).

What am I doing (see code below):
  Opening a group. Creating a dataset for all the data. Each process selects via hyperslabs where to write the data to (position).
  The selected hyperslabs of two processes contain overlapping portions (exactly the same data in the overlapping parts).
  After that I do a collective write call. It is this call, which consumes a lot of time (approx. 10 min for N=70e3).
Making the hypersalbs non-overlapping (takes less than a second) before does not change the execution time significantly.

The program is being executed on a cluster having gpfs available and enabled with mpirun and ~100 processes,
but might also be executed using other (non-parallel) file systems.

Does anyone have a hint for me?
Maybe I made a simple rookie mistake, which I just can't find.

I would not write this post if I had found the answer in manuals, tutorials or via google.

Thanks!

Best Regards,
   Markus

Code-snippet:
##################
   group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

   dataspace_id = H5Screate_simple(2, dims, NULL);
   dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coord_data_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
   status = H5Sclose(dataspace_id);

   hid_t filespace = H5Dget_space(dataset_id);

   slab_dims[0] = N; slab_dims[1] = 3;
   dataspace_id = H5Screate_simple(2, slab_dims, NULL);

   slab_dims[0] = 1 ;
   for (int j = 0; j < N; j++)
   {
     offset[0] = this->_coord_map[j]; // figure out where to write the slab to
     // write the row as hyperslab to file
     // with partially overlapping data (the overlapping portions contain the same numbers/data).
      if (j==0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, slab_dims, NULL);
      else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, offset, NULL, slab_dims, NULL);
   }
   hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
   H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

   // This is the slow part:
   status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);

   status = H5Pclose(plist_id);
   status = H5Sclose(dataspace_id);

   H5Sclose(filespace);
   status = H5Dclose(dataset_id);
   status = H5Gclose(group_id);
#####################

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

Thanks for your help!
I upgraded to HDF5 1.8.8 (src) as you suggested.
Now H5Pget_mpio_actual_io_mode() is called right after H5DWrite and
gives 0 ( = H5D_MPIO_NO_COLLECTIVE - return value is the second argument).

Why H5D_MPIO_NO_COLLECTIVE is returned I do not understand.
It is true that the data written by a process is non-continguous, can this be the cause?

Best Regards,

    Markus

···

On 2012-04-25 13:23, Quincey Koziol wrote:

Hi Markus,
  Seems like it should be working, but can you upgrade to 1.8.8 (or the 1.8.9 prerelease) and use the H5Pget_mpio_actual_io_mode() routine to see if collective I/O is occurring? Further actions will depend on the results from that routine...

  Quincey

On Apr 24, 2012, at 8:41 AM, Markus Bina wrote:

Dear HDF Forum users.

In my program I started using HDF5 1.8.2 and hyperslabs to write distributed data to a single output file a few weeks ago.

The data is a Nx3 matrix (N is known at runtime).

What am I doing (see code below):
   Opening a group. Creating a dataset for all the data. Each process selects via hyperslabs where to write the data to (position).
   The selected hyperslabs of two processes contain overlapping portions (exactly the same data in the overlapping parts).
   After that I do a collective write call. It is this call, which consumes a lot of time (approx. 10 min for N=70e3).
Making the hypersalbs non-overlapping (takes less than a second) before does not change the execution time significantly.

The program is being executed on a cluster having gpfs available and enabled with mpirun and ~100 processes,
but might also be executed using other (non-parallel) file systems.

Does anyone have a hint for me?
Maybe I made a simple rookie mistake, which I just can't find.

I would not write this post if I had found the answer in manuals, tutorials or via google.

Thanks!

Best Regards,
    Markus

Code-snippet:
##################
    group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

    dataspace_id = H5Screate_simple(2, dims, NULL);
    dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coord_data_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    status = H5Sclose(dataspace_id);

    hid_t filespace = H5Dget_space(dataset_id);

    slab_dims[0] = N; slab_dims[1] = 3;
    dataspace_id = H5Screate_simple(2, slab_dims, NULL);

    slab_dims[0] = 1 ;
    for (int j = 0; j< N; j++)
    {
      offset[0] = this->_coord_map[j]; // figure out where to write the slab to
      // write the row as hyperslab to file
      // with partially overlapping data (the overlapping portions contain the same numbers/data).
       if (j==0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, slab_dims, NULL);
       else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, offset, NULL, slab_dims, NULL);
    }
    hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

    // This is the slow part:
    status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);

    status = H5Pclose(plist_id);
    status = H5Sclose(dataspace_id);

    H5Sclose(filespace);
    status = H5Dclose(dataset_id);
    status = H5Gclose(group_id);
#####################

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Markus,

Hi Quincey,

Thanks for your help!
I upgraded to HDF5 1.8.8 (src) as you suggested.
Now H5Pget_mpio_actual_io_mode() is called right after H5DWrite and
gives 0 ( = H5D_MPIO_NO_COLLECTIVE - return value is the second argument).

  Ah, OK, for some reason, whatever you are doing is forcing the library into using independent I/O.

Why H5D_MPIO_NO_COLLECTIVE is returned I do not understand.
It is true that the data written by a process is non-continguous, can this be the cause?

  We've got plans to implement another API routine which would answer this question, but for now, you'll have to put some printf() in the H5D_mpio_opt_possible routine (in src/H5Dmpio.c) to see which of the conditions that could break collective I/O is getting triggered when your application calls H5Dwrite(). Let me know which condition gets triggered and we can talk about the overall reason.

  Quincey

···

On Apr 26, 2012, at 6:55 AM, Markus Bina wrote:

Best Regards,

  Markus

On 2012-04-25 13:23, Quincey Koziol wrote:

Hi Markus,
  Seems like it should be working, but can you upgrade to 1.8.8 (or the 1.8.9 prerelease) and use the H5Pget_mpio_actual_io_mode() routine to see if collective I/O is occurring? Further actions will depend on the results from that routine...

  Quincey

On Apr 24, 2012, at 8:41 AM, Markus Bina wrote:

Dear HDF Forum users.

In my program I started using HDF5 1.8.2 and hyperslabs to write distributed data to a single output file a few weeks ago.

The data is a Nx3 matrix (N is known at runtime).

What am I doing (see code below):
  Opening a group. Creating a dataset for all the data. Each process selects via hyperslabs where to write the data to (position).
  The selected hyperslabs of two processes contain overlapping portions (exactly the same data in the overlapping parts).
  After that I do a collective write call. It is this call, which consumes a lot of time (approx. 10 min for N=70e3).
Making the hypersalbs non-overlapping (takes less than a second) before does not change the execution time significantly.

The program is being executed on a cluster having gpfs available and enabled with mpirun and ~100 processes,
but might also be executed using other (non-parallel) file systems.

Does anyone have a hint for me?
Maybe I made a simple rookie mistake, which I just can't find.

I would not write this post if I had found the answer in manuals, tutorials or via google.

Thanks!

Best Regards,
   Markus

Code-snippet:
##################
   group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

   dataspace_id = H5Screate_simple(2, dims, NULL);
   dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coord_data_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
   status = H5Sclose(dataspace_id);

   hid_t filespace = H5Dget_space(dataset_id);

   slab_dims[0] = N; slab_dims[1] = 3;
   dataspace_id = H5Screate_simple(2, slab_dims, NULL);

   slab_dims[0] = 1 ;
   for (int j = 0; j< N; j++)
   {
     offset[0] = this->_coord_map[j]; // figure out where to write the slab to
     // write the row as hyperslab to file
     // with partially overlapping data (the overlapping portions contain the same numbers/data).
      if (j==0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, slab_dims, NULL);
      else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, offset, NULL, slab_dims, NULL);
   }
   hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
   H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

   // This is the slow part:
   status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);

   status = H5Pclose(plist_id);
   status = H5Sclose(dataspace_id);

   H5Sclose(filespace);
   status = H5Dclose(dataset_id);
   status = H5Gclose(group_id);
#####################

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

I did what you suggested (cf. modified H5D_mpio_opt_possible() below).
According to http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2010-July/003321.html
having overlapping selections for write is not a good idea. Thus I ensure now that there are no overlapping
selections for two or more processes.

As before the write-call is the slow part.
The output (np=10; dataspace: 70e3 x 3 doubles) is:
"
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
after selection
  BEFORE WRITE
H5D_mpio_opt_possible CALLED !!!
H5D_mpio_opt_possible: DONE - retvalue = true
"
After about 10 min. (70e3x3 doubles have been written) it prints:
"
   AFTER WRITE
"
The load (cpu, mem and network) on the cluster is low at 33% avg load before I submit my job.
It also just occurred to me that the data in the h5 file after write is scrambled (unsorted as it should *not* be), when I use hyperslabs.
When I use point selection this is not the case (no collective write possible with point selection, of course ...) and the write process is faster.
Thus I started to clean the code up and attach it again.

Thanks again for helping me!

I still think that I am not using the H5 library correctly...

Best Regards,

    Markus

######### refined code that does the H5 calls ##############

   #define USE_HYPERSLABS_O

   void write_mesh_nodes(int mype, const std::set<int> & real_coords, const std::string group_name,
       const double * dcoord,
       const int num_coord,
       const int num_global_coords) const
   {
     assert(num_coord == _coord_map_size);

     hid_t group_id, dataset_id, dataspace_id;
     herr_t status;
     hsize_t dims[2];
     int count = 0;
     double * data = NULL;

     // specify the whole dataset
     dims[0] = num_global_coords; // N = ~70e3 for testing
     dims[1] = 3; // fixed to 3

     group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

     dataspace_id = H5Screate_simple(2, dims, NULL);
     dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coordinates_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
     status = H5Sclose(dataspace_id);

     hid_t filespace = H5Dget_space(dataset_id);

    hsize_t coord[2];
    coord[0] = 0;

    if(real_coords.size() > 0) PetscMalloc( sizeof(double) * 3 * real_coords.size(), &data);

    hsize_t elem_dim[1];
    elem_dim[0] = 3 * real_coords.size() ;

    dataspace_id = H5Screate_simple(1, elem_dim, NULL);

    H5Sselect_none(filespace); // select none - just to be sure

    // each slab is 1x3 selected and or'ed to the previous selection
    hsize_t slab_dims[2];
    slab_dims[0] = 1 ;
    slab_dims[1] = 3 ;

    coord[1] = 0; // also called offset

    std::cout << " before selection " << std::endl;

    for (int j = 0; j < num_coord; j++)
    {
      coord[0] = this->_coord_map[j]; // figure out where to write the slab to
      // skip data, which we shall not write (no overlaps!)
      if ( real_coords.count(coord[0]) == 0 ) { continue; }
      // copy to temporary buffer for H5Dwrite
      for (int i = 0 ; i < 3; i++) { data[count*3+i] = dcoord[j*3+i]; }

#ifdef USE_HYPERSLABS_O
      if (count == 0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, coord, NULL, slab_dims, NULL);
      else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, coord, NULL, slab_dims, NULL);
#else
      coord[1] = 0;
      H5Sselect_elements(filespace, H5S_SELECT_APPEND, 1, coord);
      coord[1] = 1;
      H5Sselect_elements(filespace, H5S_SELECT_APPEND, 1, coord);
      coord[1] = 2;
      H5Sselect_elements(filespace, H5S_SELECT_APPEND, 1, coord);
#endif

      count++; // counts the number of 1x3 blocks to write
    }

    std::cout << "after selection" << std::endl;

    H5D_mpio_actual_io_mode_t actual_io_mode;

    hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

    // This is the slow part:
    std::cout << " BEFORE WRITE " << std::endl;
    status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);
    std::cout << " AFTER WRITE" << std::endl;

    H5Pget_mpio_actual_io_mode(plist_id, &actual_io_mode);

    status = H5Pclose(plist_id);
    if (data != NULL) PetscFree(data); data = NULL;

    H5Sclose(filespace);
    status = H5Dclose(dataset_id);
    status = H5Gclose(group_id);
}

···

#############################################

######### modified H5D_mpio_opt_possible() ###############
htri_t
H5D_mpio_opt_possible(const H5D_io_info_t *io_info, const H5S_t *file_space,
     const H5S_t *mem_space, const H5D_type_info_t *type_info,
     const H5D_chunk_map_t *fm)
{
     int local_opinion = TRUE; /* This process's idea of whether to perform collective I/O or not */
     int consensus; /* Consensus opinion of all processes */
     int mpi_code; /* MPI error code */
     htri_t ret_value = TRUE;

     FUNC_ENTER_NOAPI(H5D_mpio_opt_possible, FAIL)

     /* Check args */
     HDassert(io_info);
     HDassert(mem_space);
     HDassert(file_space);
     HDassert(type_info);

printf("H5D_mpio_opt_possible CALLED !!!\n"); // <================

     /* For independent I/O, get out quickly and don't try to form consensus */
     if(io_info->dxpl_cache->xfer_mode == H5FD_MPIO_INDEPENDENT)
     {
         printf("INDEPENDENT IO !!\n"); // <================
         HGOTO_DONE(FALSE);
     }

     /* Don't allow collective operations if datatype conversions need to happen */
     if(!type_info->is_conv_noop) {
         local_opinion = FALSE;
         printf("CONVERSION \n"); // <================
         goto broadcast;
     } /* end if */

     /* Don't allow collective operations if data transform operations should occur */
     if(!type_info->is_xform_noop) {
         local_opinion = FALSE;
         printf("DATA TRANSFORM \n"); // <================
         goto broadcast;
     } /* end if */

     /* Optimized MPI types flag must be set and it must be collective IO */
     /* (Don't allow parallel I/O for the MPI-posix driver, since it doesn't do real collective I/O) */
     if(!(H5S_mpi_opt_types_g && io_info->dxpl_cache->xfer_mode == H5FD_MPIO_COLLECTIVE
&& !IS_H5FD_MPIPOSIX(io_info->dset->oloc.file))) {
         local_opinion = FALSE;
         printf("MPI-POSIX \n"); // <================
         goto broadcast;
     } /* end if */

     /* Check whether these are both simple or scalar dataspaces */
     if(!((H5S_SIMPLE == H5S_GET_EXTENT_TYPE(mem_space) || H5S_SCALAR == H5S_GET_EXTENT_TYPE(mem_space))
&& (H5S_SIMPLE == H5S_GET_EXTENT_TYPE(file_space) || H5S_SCALAR == H5S_GET_EXTENT_TYPE(file_space)))) {
         local_opinion = FALSE;
         printf("SIMPLE OR SCALAR\n"); // <================
         goto broadcast;
     } /* end if */

     /* Can't currently handle point selections */
     if(H5S_SEL_POINTS == H5S_GET_SELECT_TYPE(mem_space)
             >> H5S_SEL_POINTS == H5S_GET_SELECT_TYPE(file_space)) {
         local_opinion = FALSE;
         printf("POINT SELECTION\n"); // <================
         goto broadcast;
     } /* end if */

     /* Dataset storage must be contiguous or chunked */
     if(!(io_info->dset->shared->layout.type == H5D_CONTIGUOUS ||
             io_info->dset->shared->layout.type == H5D_CHUNKED)) {
         local_opinion = FALSE;
         printf("STORAGE: NOT CONTIGUOUS OR CHUNKED\n"); // <================
         goto broadcast;
     } /* end if */

     /* The handling of memory space is different for chunking and contiguous
      * storage. For contiguous storage, mem_space and file_space won't change
      * when it it is doing disk IO. For chunking storage, mem_space will
      * change for different chunks. So for chunking storage, whether we can
      * use collective IO will defer until each chunk IO is reached.
      */

     /* Don't allow collective operations if filters need to be applied */
     if(io_info->dset->shared->layout.type == H5D_CHUNKED) {
         if(io_info->dset->shared->dcpl_cache.pline.nused > 0) {
             local_opinion = FALSE;
             printf("FILTERS \n"); // <================
             goto broadcast;
         } /* end if */
     } /* end if */

broadcast:
     /* Form consensus opinion among all processes about whether to perform
      * collective I/O
      */
     if(MPI_SUCCESS != (mpi_code = MPI_Allreduce(&local_opinion, &consensus, 1, MPI_INT, MPI_LAND, io_info->comm)))
         HMPI_GOTO_ERROR(FAIL, "MPI_Allreduce failed", mpi_code)

     ret_value = consensus > 0 ? TRUE : FALSE;

done:
     printf("H5D_mpio_opt_possible: DONE - retvalue = %s\n", ((ret_value == TRUE) ? "true" : "false" ) ); // <================
     FUNC_LEAVE_NOAPI(ret_value)
} /* H5D_mpio_opt_possible() */
#######################################################

On 2012-04-27 05:26, Quincey Koziol wrote:

Hi Markus,

On Apr 26, 2012, at 6:55 AM, Markus Bina wrote:

Hi Quincey,

Thanks for your help!
I upgraded to HDF5 1.8.8 (src) as you suggested.
Now H5Pget_mpio_actual_io_mode() is called right after H5DWrite and
gives 0 ( = H5D_MPIO_NO_COLLECTIVE - return value is the second argument).

  Ah, OK, for some reason, whatever you are doing is forcing the library into using independent I/O.

Why H5D_MPIO_NO_COLLECTIVE is returned I do not understand.
It is true that the data written by a process is non-continguous, can this be the cause?

  We've got plans to implement another API routine which would answer this question, but for now, you'll have to put some printf() in the H5D_mpio_opt_possible routine (in src/H5Dmpio.c) to see which of the conditions that could break collective I/O is getting triggered when your application calls H5Dwrite(). Let me know which condition gets triggered and we can talk about the overall reason.

  Quincey

Best Regards,

   Markus

On 2012-04-25 13:23, Quincey Koziol wrote:

Hi Markus,
  Seems like it should be working, but can you upgrade to 1.8.8 (or the 1.8.9 prerelease) and use the H5Pget_mpio_actual_io_mode() routine to see if collective I/O is occurring? Further actions will depend on the results from that routine...

  Quincey

On Apr 24, 2012, at 8:41 AM, Markus Bina wrote:

Dear HDF Forum users.

In my program I started using HDF5 1.8.2 and hyperslabs to write distributed data to a single output file a few weeks ago.

The data is a Nx3 matrix (N is known at runtime).

What am I doing (see code below):
   Opening a group. Creating a dataset for all the data. Each process selects via hyperslabs where to write the data to (position).
   The selected hyperslabs of two processes contain overlapping portions (exactly the same data in the overlapping parts).
   After that I do a collective write call. It is this call, which consumes a lot of time (approx. 10 min for N=70e3).
Making the hypersalbs non-overlapping (takes less than a second) before does not change the execution time significantly.

The program is being executed on a cluster having gpfs available and enabled with mpirun and ~100 processes,
but might also be executed using other (non-parallel) file systems.

Does anyone have a hint for me?
Maybe I made a simple rookie mistake, which I just can't find.

I would not write this post if I had found the answer in manuals, tutorials or via google.

Thanks!

Best Regards,
    Markus

Code-snippet:
##################
    group_id = H5Gopen(file_id, group_name.c_str(), H5P_DEFAULT);

    dataspace_id = H5Screate_simple(2, dims, NULL);
    dataset_id = H5Dcreate(group_id, HDFNamingPolicy::coord_data_name.c_str(), H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    status = H5Sclose(dataspace_id);

    hid_t filespace = H5Dget_space(dataset_id);

    slab_dims[0] = N; slab_dims[1] = 3;
    dataspace_id = H5Screate_simple(2, slab_dims, NULL);

    slab_dims[0] = 1 ;
    for (int j = 0; j< N; j++)
    {
      offset[0] = this->_coord_map[j]; // figure out where to write the slab to
      // write the row as hyperslab to file
      // with partially overlapping data (the overlapping portions contain the same numbers/data).
       if (j==0) status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, slab_dims, NULL);
       else status = H5Sselect_hyperslab(filespace, H5S_SELECT_OR, offset, NULL, slab_dims, NULL);
    }
    hid_t plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

    // This is the slow part:
    status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, dataspace_id, filespace, plist_id, data);

    status = H5Pclose(plist_id);
    status = H5Sclose(dataspace_id);

    H5Sclose(filespace);
    status = H5Dclose(dataset_id);
    status = H5Gclose(group_id);
#####################

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org