Slow or buggy using H5Sselect_elements

Hi, HDF users,

I am trying to write a 1D array from several processes in an anarchic way.
Each proc has a subset of the array to write, but elements are not
contiguous and unsorted. Each proc knows the positions where it should
write each element.

With some help from the thread

http://hdf-forum.184993.n3.nabble.com/HDF5-Parallel-write-selection-using-hyperslabs-slow-write-tp3935966.html
I tried to implement it.

First, the master proc (only that one) creates the file:

        // create file
        hid_t fid = H5Fcreate( name.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_UINT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

Then, all procs open the file and write their subset:

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    // Open the file
    hid_t fid = H5Fopen( name.c_str(), H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, dataset.c_str(), H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has
been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords
);
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_UINT, mem_space , file_space , transfer, data
);
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

This version works but is VERY SLOW: more than 10 times slower than writing
with 1 proc without H5Sselect_elements.
Is this to be expected? Is there a way to make it faster?

Using H5Pget_mpio_actual_io_mode, I realized that it was not using
collective transfer, so I tried to force it using the following:

    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);

But unfortunately, I get tons of the following error:

HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 0:
  #000: H5Dio.c line 271 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 352 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 788 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 757 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 685 in H5D__chunk_collective_io(): couldn't finish
linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 881 in H5D__link_chunk_collective_io(): couldn't
finish shared collective MPI-IO
    major: Data storage
    minor: Can't get value
  #006: H5Dmpio.c line 1401 in H5D__inter_collective_io(): couldn't finish
collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #007: H5Dmpio.c line 1445 in H5D__final_collective_io(): optimized write
failed
    major: Dataset
    minor: Write failed
  #008: H5Dmpio.c line 297 in H5D__mpio_select_write(): can't finish
collective parallel write
    major: Low-level I/O
    minor: Write failed
  #009: H5Fio.c line 171 in H5F_block_write(): write through metadata
accumulator failed
    major: Low-level I/O
    minor: Write failed
  #010: H5Faccum.c line 825 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #011: H5FDint.c line 246 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #012: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_File_set_view failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #013: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_ERR_ARG: invalid
argument of some other kind
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

The same happens with both HDF5 1.8.14 and 1.8.15

Any ideas how to fix this ?

Thank you
Fred

   #013: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_ERR_ARG: invalid
argument of some other kind
     major: Internal error (too specific to document in detail)
     minor: MPI Error String

I'm surprised and disappointed there's not a better error message here.

Why isn't HDF5 calling MPI_Error_string and displaying what the MPI implementation is trying to tell us is wrong?

==rob

···

On 09/03/2015 08:52 AM, Frederic Perez wrote:

The same happens with both HDF5 1.8.14 and 1.8.15

Any ideas how to fix this ?

Thank you
Fred

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Hi Frederic,

Yes writing in parallel to a dataset with point selections independently is going to be slow. This is expected.
However doing it collectively should not be slow and should work.

May I bother you to write a program that reproduces the failures that you see with collective I/O? The code that you pasted will not compile since it’s missing some variables (coords and local_nElements, etc…).

Thanks,
Mohamad

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Frederic Perez
Sent: Thursday, September 03, 2015 8:52 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Slow or buggy using H5Sselect_elements

Hi, HDF users,
I am trying to write a 1D array from several processes in an anarchic way. Each proc has a subset of the array to write, but elements are not contiguous and unsorted. Each proc knows the positions where it should write each element.
With some help from the thread
    http://hdf-forum.184993.n3.nabble.com/HDF5-Parallel-write-selection-using-hyperslabs-slow-write-tp3935966.html
I tried to implement it.

First, the master proc (only that one) creates the file:

        // create file
        hid_t fid = H5Fcreate( name.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_UINT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );
Then, all procs open the file and write their subset:

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    // Open the file
    hid_t fid = H5Fopen( name.c_str(), H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, dataset.c_str(), H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_UINT, mem_space , file_space , transfer, data );
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );
This version works but is VERY SLOW: more than 10 times slower than writing with 1 proc without H5Sselect_elements.
Is this to be expected? Is there a way to make it faster?

Using H5Pget_mpio_actual_io_mode, I realized that it was not using collective transfer, so I tried to force it using the following:

    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
But unfortunately, I get tons of the following error:

HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 0:
  #000: H5Dio.c line 271 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 352 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 788 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 757 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 685 in H5D__chunk_collective_io(): couldn't finish linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 881 in H5D__link_chunk_collective_io(): couldn't finish shared collective MPI-IO
    major: Data storage
    minor: Can't get value
  #006: H5Dmpio.c line 1401 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #007: H5Dmpio.c line 1445 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #008: H5Dmpio.c line 297 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #009: H5Fio.c line 171 in H5F_block_write(): write through metadata accumulator failed
    major: Low-level I/O
    minor: Write failed
  #010: H5Faccum.c line 825 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #011: H5FDint.c line 246 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #012: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_File_set_view failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #013: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_ERR_ARG: invalid argument of some other kind
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
The same happens with both HDF5 1.8.14 and 1.8.15
Any ideas how to fix this ?
Thank you
Fred

Good question! I’ll file a JIRA issue.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 3, 2015, at 10:56 AM, Rob Latham <robl@mcs.anl.gov<mailto:robl@mcs.anl.gov>> wrote:

On 09/03/2015 08:52 AM, Frederic Perez wrote:

  #013: H5FDmpio.c line 1802 in H5FD_mpio_write(): MPI_ERR_ARG: invalid
argument of some other kind
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

I'm surprised and disappointed there's not a better error message here.

Why isn't HDF5 calling MPI_Error_string and displaying what the MPI implementation is trying to tell us is wrong?

==rob

The same happens with both HDF5 1.8.14 and 1.8.15

Any ideas how to fix this ?

Thank you
Fred

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

In response to Mohamad Chaarawi, here is a full code that compiles, but it
surprisingly does not raise the errors that I obtained with my full code. I
will investigate into that. In the meantime, I found that the short version
below does not have improved performance when I remove the line that sets
H5FD_MPIO_COLLECTIVE.

Cheers
Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements,
MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has
been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords
);
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data
);
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << rk << " " << duration << endl;

    MPI_Finalize();
}

I finally found a way to reproduce the errors I sent in the first post
(pasted code below).

The problem is that the collective data transfer does not work after using
H5Dset_extent. This behaviour does not even depend on H5Sselect_elements,
as I initially thought.

Anybody knows if there is something wrong with my use of H5Dset_extent?

Anybody knows why I does not get any performance improvement when using
H5FD_MPIO_COLLECTIVE?

Thank you for your help
Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    hsize_t dims[2] = {0, global_nElements};

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements,
MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    dims[0] ++;
    H5Dset_extent(did, dims);
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has
been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords
);
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data
);
    H5D_mpio_actual_io_mode_t actual_io_mode;
    H5Pget_mpio_actual_io_mode(transfer, &actual_io_mode);
    cout << "rank " << rk << " - actual_io_mode " << actual_io_mode << endl;
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << "rank " << rk << ", duration = " << duration << endl;

    MPI_Finalize();
}

···

2015-09-04 0:21 GMT+02:00 Frederic Perez <fredericperez1@gmail.com>:

In response to Mohamad Chaarawi, here is a full code that compiles, but it
surprisingly does not raise the errors that I obtained with my full code. I
will investigate into that. In the meantime, I found that the short version
below does not have improved performance when I remove the line that sets
H5FD_MPIO_COLLECTIVE.

Cheers
Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements,
MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array
has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements,
coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data
);
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << rk << " " << duration << endl;

    MPI_Finalize();
}

I haven’t checked the performance yet, but the error you are running into is a known issue that you can fix by either:

1) Create the file on process 0 using the MPI-IO VFD with MPI_COMM_SELF:

        hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);

        H5Pset_fapl_mpio( file_access, MPI_COMM_SELF, MPI_INFO_NULL );

2) Or Create the dataset on process 0 with space allocation set to H5D_ALLOC_TIME_EARLY:
hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_layout(plist, H5D_CHUNKED);
H5Pset_alloc_time(plist, H5D_ALLOC_TIME_EARLY)

This should fix the error you are seeing.. The problem is that when you create the file on process 0 with the sec2 driver, the space allocation for the dataset is set to LATE by default, however parallel access requires it to be set to EARLY. We were debating whether this is a bug or just a documentation issue that we should make more clear for dataset creation. I believe we decided for the latter, but I couldn’t find where this was documented, so I’ll follow up on that too internally.

I’ll look into performance once I get some time..

Thanks
Mohamad

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Frederic Perez
Sent: Friday, September 04, 2015 9:33 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Slow or buggy using H5Sselect_elements

I finally found a way to reproduce the errors I sent in the first post (pasted code below).
The problem is that the collective data transfer does not work after using H5Dset_extent. This behaviour does not even depend on H5Sselect_elements, as I initially thought.
Anybody knows if there is something wrong with my use of H5Dset_extent?
Anybody knows why I does not get any performance improvement when using H5FD_MPIO_COLLECTIVE?
Thank you for your help
Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    hsize_t dims[2] = {0, global_nElements};

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        // prepare file space
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements, MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    dims[0] ++;
    H5Dset_extent(did, dims);
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data );
    H5D_mpio_actual_io_mode_t actual_io_mode;
    H5Pget_mpio_actual_io_mode(transfer, &actual_io_mode);
    cout << "rank " << rk << " - actual_io_mode " << actual_io_mode << endl;
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << "rank " << rk << ", duration = " << duration << endl;

    MPI_Finalize();
}

2015-09-04 0:21 GMT+02:00 Frederic Perez <fredericperez1@gmail.com<mailto:fredericperez1@gmail.com>>:
In response to Mohamad Chaarawi, here is a full code that compiles, but it surprisingly does not raise the errors that I obtained with my full code. I will investigate into that. In the meantime, I found that the short version below does not have improved performance when I remove the line that sets H5FD_MPIO_COLLECTIVE.
Cheers
Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements, MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements, coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data );
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << rk << " " << duration << endl;

    MPI_Finalize();
}

Thank you for the help. I tried your second solution and it works.

Concerning the performance, I did not do thorough tests, but i actually got
better results going towards higher number of procs (up to 32). It looks
like it depends a lot on the particular case, however.

Fred

···

Le 4 sept. 2015 5:03 PM, "Mohamad Chaarawi" <chaarawi@hdfgroup.org> a écrit :

I haven’t checked the performance yet, but the error you are running into
is a known issue that you can fix by either:

1) Create the file on process 0 using the MPI-IO VFD with
MPI_COMM_SELF:

        hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);

        H5Pset_fapl_mpio( file_access, MPI_COMM_SELF, MPI_INFO_NULL );

2) Or Create the dataset on process 0 with space allocation set to
H5D_ALLOC_TIME_EARLY:

hid_t plist = H5Pcreate(H5P_DATASET_CREATE);

H5Pset_layout(plist, H5D_CHUNKED);

H5Pset_alloc_time(plist, H5D_ALLOC_TIME_EARLY)

This should fix the error you are seeing.. The problem is that when you
create the file on process 0 with the sec2 driver, the space allocation for
the dataset is set to LATE by default, however parallel access requires it
to be set to EARLY. We were debating whether this is a bug or just a
documentation issue that we should make more clear for dataset creation. I
believe we decided for the latter, but I couldn’t find where this was
documented, so I’ll follow up on that too internally.

I’ll look into performance once I get some time..

Thanks

Mohamad

*From:* Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On
Behalf Of *Frederic Perez
*Sent:* Friday, September 04, 2015 9:33 AM
*To:* HDF Users Discussion List
*Subject:* Re: [Hdf-forum] Slow or buggy using H5Sselect_elements

I finally found a way to reproduce the errors I sent in the first post
(pasted code below).

The problem is that the collective data transfer does not work after using
H5Dset_extent. This behaviour does not even depend on H5Sselect_elements,
as I initially thought.

Anybody knows if there is something wrong with my use of H5Dset_extent?

Anybody knows why I does not get any performance improvement when using
H5FD_MPIO_COLLECTIVE?

Thank you for your help

Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    hsize_t dims[2] = {0, global_nElements};

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements,
MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    dims[0] ++;
    H5Dset_extent(did, dims);
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array
has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements,
coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data
);
    H5D_mpio_actual_io_mode_t actual_io_mode;
    H5Pget_mpio_actual_io_mode(transfer, &actual_io_mode);
    cout << "rank " << rk << " - actual_io_mode " << actual_io_mode <<
endl;
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << "rank " << rk << ", duration = " << duration << endl;

    MPI_Finalize();
}

2015-09-04 0:21 GMT+02:00 Frederic Perez <fredericperez1@gmail.com>:

In response to Mohamad Chaarawi, here is a full code that compiles, but it
surprisingly does not raise the errors that I obtained with my full code. I
will investigate into that. In the meantime, I found that the short version
below does not have improved performance when I remove the line that sets
H5FD_MPIO_COLLECTIVE.

Cheers

Fred

#include <mpi.h>
#include <hdf5.h>
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <ctime>

using namespace std;

int main (int argc, char* argv[])
{
    int mpi_provided, sz, rk;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_FUNNELED, &mpi_provided );
    MPI_Comm_size( MPI_COMM_WORLD, &sz );
    MPI_Comm_rank( MPI_COMM_WORLD, &rk );

    hsize_t local_nElements = 10000;
    hsize_t global_nElements = sz*local_nElements;

    vector<int> vec;
    vec.resize(1);

    if (rk==0) {
        // create file
        hid_t fid = H5Fcreate( "test.h5", H5F_ACC_TRUNC, H5P_DEFAULT,
H5P_DEFAULT);
        // prepare file space
        hsize_t dims[2] = {1, global_nElements};
        hsize_t max_dims[2] = {H5S_UNLIMITED, global_nElements}; // not
really needed but for future use
        hid_t file_space = H5Screate_simple(2, dims, max_dims);
        // prepare dataset
        hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
        H5Pset_layout(plist, H5D_CHUNKED);
        hsize_t chunk_dims[2] = {1, global_nElements};
        H5Pset_chunk(plist, 2, chunk_dims);
        // create dataset
        hid_t did = H5Dcreate(fid, "Id", H5T_NATIVE_INT, file_space,
H5P_DEFAULT, plist, H5P_DEFAULT);

        H5Dclose(did);
        H5Pclose(plist);
        H5Sclose(file_space);
        H5Fclose( fid );

        // make a randomized vector
        vec.resize(global_nElements);
        for(int i=0; i<global_nElements; i++) vec[i]=i;
        random_shuffle(vec.begin(), vec.end());
    }

    // Scatter the randomized vector
    int * v = &vec[0];
    int * data = new int[local_nElements];
    MPI_Scatter( v, local_nElements, MPI_INT, data, local_nElements,
MPI_INT, 0, MPI_COMM_WORLD);

    hsize_t * coords = new hsize_t[local_nElements*2];
    for(int i=0; i<local_nElements; i++) {
        coords[i*2 ] = 0;
        coords[i*2+1] = data[i];
    }

    clock_t start = clock();

    // define MPI file access
    hid_t file_access = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio( file_access, MPI_COMM_WORLD, MPI_INFO_NULL );
    // define MPI transfer mode
    hid_t transfer = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio( transfer, H5FD_MPIO_COLLECTIVE);
    // Open the file
    hid_t fid = H5Fopen( "test.h5", H5F_ACC_RDWR, file_access);
    // Open the existing dataset
    hid_t did = H5Dopen( fid, "Id", H5P_DEFAULT );
    // Get the file space
    hid_t file_space = H5Dget_space(did);
    // Define the memory space for this proc
    hsize_t count[2] = {1, (hsize_t) local_nElements};
    hid_t mem_space = H5Screate_simple(2, count, NULL);
    // Select the elements for this particular proc (the `coords` array
has been created before)
    H5Sselect_elements( file_space, H5S_SELECT_SET, local_nElements,
coords );
    // Write the previously generated `data` array
    H5Dwrite( did, H5T_NATIVE_INT, mem_space , file_space , transfer, data
);
    // Close stuff
    H5Sclose(file_space);
    H5Dclose(did);
    H5Fclose( fid );

    double duration = ( clock() - start ) / (double) CLOCKS_PER_SEC;
    cout << rk << " " << duration << endl;

    MPI_Finalize();
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5