Slow Reading 721GB File in Parallel

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more nodes, it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include <hdf5.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i < 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1], region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are you just reading the same data on all processes?

Mohamad

···

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more nodes, it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
   char attributes[6][50];
   sprintf(attributes[0], "/uvel");
   sprintf(attributes[1], "/vvel");
   sprintf(attributes[2], "/wvel");
   sprintf(attributes[3], "/pressure");
   sprintf(attributes[4], "/temp");
   sprintf(attributes[5], "/OH");

   herr_t status;
   hid_t file_id;
   hid_t dset_id;
   hid_t dset_plist;
   // open file spaces
   hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
   status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
   file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
   status = H5Pclose(acc_tpl);
   for (int i = 0; i< 6; ++i)
   {
     // open dataset
     dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

     // get dataset space
     hid_t spac_id = H5Dget_space(dset_id);
     hsize_t htotal_size3[3];
     status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
     hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                                htotal_size3[1] / region_count[1],
                                htotal_size3[2] / region_count[2]};

     // hyperslab
     hsize_t start[3] = {region_index[0] * region_size3[0],
                         region_index[1] * region_size3[1],
                         region_index[2] * region_size3[2]};
     hsize_t count[3] = {region_size3[0], region_size3[1], region_size3[2]};
     status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
     hid_t memspace = H5Screate_simple(3, count, NULL);

     // read
     hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
     status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

     flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
     status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

     // clean up
     H5Dclose(dset_id);
     H5Sclose(spac_id);
     H5Pclose(xfer_plist);
   }
   H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

The region_index changes according to the mpi rank while the region_count
stays the same, which is 16,16,16.

···

On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <chaarawi@hdfgroup.org> wrote:

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are you
just reading the same data on all processes?

Mohamad

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more nodes,
it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is
Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't
finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i< 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(**spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1],
region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

--
View this message in context: http://hdf-forum.184993.n3.**
nabble.com/Slow-Reading-721GB-**File-in-Parallel-tp4021429.**html<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html>
Sent from the hdf-forum mailing list archive at Nabble.com.

______________________________**_________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/**mailman/listinfo/hdf-forum_**hdfgroup.org<http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org>

______________________________**_________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/**mailman/listinfo/hdf-forum_**hdfgroup.org<http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org>

Hi Yucong ,

The region_index changes according to the mpi rank while the region_count stays the same, which is 16,16,16.

Ok, I just needed to make sure that the selections for each process are done such that it is compatible with scaling being done (as the number of processes increase, the selection of each process decreases accordingly).. The performance numbers you provided are indeed troubling, but it could be for several reasons, some being:

  * The stripe size & count of your file on Lustre could be too small.
    Although this is a read operation (no file locking is done by the
    OSTs), increasing the number of io processes puts too much burden on
    the OSTs. Could you check those 2 parameters of your file? you can
    do that by running this on the command line:
      o lfs getstripe filename | grep stripe
  * The MPI-I/O implementation is not doing aggregation. If you are
    using ROMIO, two phase should do this for you which sets the default
    to the number of nodes (not processes). I would also try and
    increase the cb_buffer_size (default is 4MBs).

Thanks,
Mohamad

···

On 5/30/2012 12:33 PM, Yucong Ye wrote:

On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <chaarawi@hdfgroup.org > <mailto:chaarawi@hdfgroup.org>> wrote:

    Hi Chrisyeshi,

    Is the region_index & region_count the same on all processes? i.e.
    Are you just reading the same data on all processes?

    Mohamad

    On 5/29/2012 3:02 PM, chrisyeshi wrote:

        Hi,

        I am having trouble to read from a 721GB file using 4096 nodes.
        When I test with a few nodes, it works, but when I test with
        more nodes, it
        takes significantly more time.
        What the test program does it only read in the data and
        deleting it.
        Here's the timing information:

        Nodes | Time For Running Entire Program
        16 4:28
        32 6:55
        64 8:56
        128 11:22
        256 13:25
        512 15:34

        768 28:34
        800 29:04

        I am running the program in a Cray XK6 system, and the file
        system is Lustre

        *There is a big gap after 512 nodes, and with 4096 nodes, it
        couldn't finish
        in 6 hours.
        Is this normal? Shouldn't it be a lot faster?*

        Here is my reading function, it's similar to the sample hdf5
        parallel
        program:

        #include<hdf5.h>
        #include<stdio.h>
        #include<stdlib.h>
        #include<assert.h>

        void readData(const char* filename, int region_index[3], int
        region_count[3], float* flow_field[6])
        {
          char attributes[6][50];
          sprintf(attributes[0], "/uvel");
          sprintf(attributes[1], "/vvel");
          sprintf(attributes[2], "/wvel");
          sprintf(attributes[3], "/pressure");
          sprintf(attributes[4], "/temp");
          sprintf(attributes[5], "/OH");

          herr_t status;
          hid_t file_id;
          hid_t dset_id;
          hid_t dset_plist;
          // open file spaces
          hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
          status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD,
        MPI_INFO_NULL);
          file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
          status = H5Pclose(acc_tpl);
          for (int i = 0; i< 6; ++i)
          {
            // open dataset
            dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

            // get dataset space
            hid_t spac_id = H5Dget_space(dset_id);
            hsize_t htotal_size3[3];
            status = H5Sget_simple_extent_dims(spac_id, htotal_size3,
        NULL);
            hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                                       htotal_size3[1] / region_count[1],
                                       htotal_size3[2] / region_count[2]};

            // hyperslab
            hsize_t start[3] = {region_index[0] * region_size3[0],
                                region_index[1] * region_size3[1],
                                region_index[2] * region_size3[2]};
            hsize_t count[3] = {region_size3[0], region_size3[1],
        region_size3[2]};
            status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET,
        start, NULL,
        count, NULL);
            hid_t memspace = H5Screate_simple(3, count, NULL);

            // read
            hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
            status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

            flow_field[i] = (float *) malloc(count[0] * count[1] *
        count[2] *
        sizeof(float));
            status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
        xfer_plist, flow_field[i]);

            // clean up
            H5Dclose(dset_id);
            H5Sclose(spac_id);
            H5Pclose(xfer_plist);
          }
          H5Fclose(file_id);
        }

        *Do you see any problem with this function? I am new to hdf5
        parallel.*

        Thanks in advance!

        --
        View this message in context:
        http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
        Sent from the hdf-forum mailing list archive at Nabble.com.

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        Hdf-forum@hdfgroup.org <mailto:Hdf-forum@hdfgroup.org>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@hdfgroup.org <mailto:Hdf-forum@hdfgroup.org>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks in advance!

The selection of each process actually stays the same size since the
region_count is not changing.

the result of running "lfs getstripe filename | grep stripe" is:

lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_stripe_offset: 286

Let me confirm with the second question.

···

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum] < ml-node+s184993n4023015h29@n3.nabble.com> wrote:

Hi Yucong ,

On 5/30/2012 12:33 PM, Yucong Ye wrote:

The region_index changes according to the mpi rank while the region_count
stays the same, which is 16,16,16.

Ok, I just needed to make sure that the selections for each process are
done such that it is compatible with scaling being done (as the number of
processes increase, the selection of each process decreases accordingly)..
The performance numbers you provided are indeed troubling, but it could be
for several reasons, some being:

   - The stripe size & count of your file on Lustre could be too small.
   Although this is a read operation (no file locking is done by the OSTs),
   increasing the number of io processes puts too much burden on the OSTs.
   Could you check those 2 parameters of your file? you can do that by running
   this on the command line:
    - lfs getstripe filename | grep stripe
       - The MPI-I/O implementation is not doing aggregation. If you are
   using ROMIO, two phase should do this for you which sets the default to the
   number of nodes (not processes). I would also try and increase the
   cb_buffer_size (default is 4MBs).

Thanks,
Mohamad

On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023015&i=0>> > wrote:

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are
you just reading the same data on all processes?

Mohamad

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more nodes,
it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is
Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't
finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i< 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1],
region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

--
View this message in context:
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
If you reply to this email, your message will be added to the discussion
below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html
To unsubscribe from Slow Reading 721GB File in Parallel, click here<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4021429&code=Y2hyaXN5ZXNoaUBnbWFpbC5jb218NDAyMTQyOXwxMTg1MjYxNzA=>
.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html
Sent from the hdf-forum mailing list archive at Nabble.com.

The selection of each process actually stays the same size since the region_count is not changing.

Ok, let me understand this again:
Your dataset size is constant (no matter what process count you execute with), and processes are reading parts of the dataset.
When you are executing your program with say 16 processes, is your dataset being divided equally (to some extent) among the 16 procs? When you increase your process count to 36, is the dataset being divided equally among 36 processes, meaning that the amount of data that a process reads decreases as you scale, since the file size is the same?
If not, then this means you are reading parts of the dataset multiple times as you scale, which makes the performance degradation expected. This is like comparing the performance, in the serial case, of 1 read operation to n read operations.
If yes, then move on to the second part..

the result of running "lfs getstripe filename | grep stripe" is:

    lmm_stripe_count: 4
    lmm_stripe_size: 1048576
    lmm_stripe_offset: 286

The stripe count is way too small for ~1 TB byte.. your system administrator should have some guidelines on what the stripe count and size should be for certain file sizes. I would check that, and readjust those parameters accordingly.

Thanks,
Mohamad

···

Let me confirm with the second question.

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum] > <[hidden email] </user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

    Hi Yucong ,

    On 5/30/2012 12:33 PM, Yucong Ye wrote:

    The region_index changes according to the mpi rank while the
    region_count stays the same, which is 16,16,16.

    Ok, I just needed to make sure that the selections for each
    process are done such that it is compatible with scaling being
    done (as the number of processes increase, the selection of each
    process decreases accordingly).. The performance numbers you
    provided are indeed troubling, but it could be for several
    reasons, some being:

      * The stripe size & count of your file on Lustre could be too
        small. Although this is a read operation (no file locking is
        done by the OSTs), increasing the number of io processes puts
        too much burden on the OSTs. Could you check those 2
        parameters of your file? you can do that by running this on
        the command line:
          o lfs getstripe filename | grep stripe
      * The MPI-I/O implementation is not doing aggregation. If you
        are using ROMIO, two phase should do this for you which sets
        the default to the number of nodes (not processes). I would
        also try and increase the cb_buffer_size (default is 4MBs).

    Thanks,
    Mohamad

    On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email] >> <http://user/SendEmail.jtp?type=node&node=4023015&i=0>> wrote:

        Hi Chrisyeshi,

        Is the region_index & region_count the same on all processes?
        i.e. Are you just reading the same data on all processes?

        Mohamad

        On 5/29/2012 3:02 PM, chrisyeshi wrote:

            Hi,

            I am having trouble to read from a 721GB file using 4096
            nodes.
            When I test with a few nodes, it works, but when I test
            with more nodes, it
            takes significantly more time.
            What the test program does it only read in the data and
            deleting it.
            Here's the timing information:

            Nodes | Time For Running Entire Program
            16 4:28
            32 6:55
            64 8:56
            128 11:22
            256 13:25
            512 15:34

            768 28:34
            800 29:04

            I am running the program in a Cray XK6 system, and the
            file system is Lustre

            *There is a big gap after 512 nodes, and with 4096 nodes,
            it couldn't finish
            in 6 hours.
            Is this normal? Shouldn't it be a lot faster?*

            Here is my reading function, it's similar to the sample
            hdf5 parallel
            program:

            #include<hdf5.h>
            #include<stdio.h>
            #include<stdlib.h>
            #include<assert.h>

            void readData(const char* filename, int region_index[3], int
            region_count[3], float* flow_field[6])
            {
              char attributes[6][50];
              sprintf(attributes[0], "/uvel");
              sprintf(attributes[1], "/vvel");
              sprintf(attributes[2], "/wvel");
              sprintf(attributes[3], "/pressure");
              sprintf(attributes[4], "/temp");
              sprintf(attributes[5], "/OH");

              herr_t status;
              hid_t file_id;
              hid_t dset_id;
              hid_t dset_plist;
              // open file spaces
              hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
              status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD,
            MPI_INFO_NULL);
              file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
              status = H5Pclose(acc_tpl);
              for (int i = 0; i< 6; ++i)
              {
                // open dataset
                dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

                // get dataset space
                hid_t spac_id = H5Dget_space(dset_id);
                hsize_t htotal_size3[3];
                status = H5Sget_simple_extent_dims(spac_id,
            htotal_size3, NULL);
                hsize_t region_size3[3] = {htotal_size3[0] /
            region_count[0],
                                           htotal_size3[1] /
            region_count[1],
                                           htotal_size3[2] /
            region_count[2]};

                // hyperslab
                hsize_t start[3] = {region_index[0] * region_size3[0],
                                    region_index[1] * region_size3[1],
                                    region_index[2] * region_size3[2]};
                hsize_t count[3] = {region_size3[0], region_size3[1],
            region_size3[2]};
                status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET,
            start, NULL,
            count, NULL);
                hid_t memspace = H5Screate_simple(3, count, NULL);

                // read
                hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
                status = H5Pset_dxpl_mpio(xfer_plist,
            H5FD_MPIO_COLLECTIVE);

                flow_field[i] = (float *) malloc(count[0] * count[1]
            * count[2] *
            sizeof(float));
                status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace,
            spac_id,
            xfer_plist, flow_field[i]);

                // clean up
                H5Dclose(dset_id);
                H5Sclose(spac_id);
                H5Pclose(xfer_plist);
              }
              H5Fclose(file_id);
            }

            *Do you see any problem with this function? I am new to
            hdf5 parallel.*

            Thanks in advance!

            --
            View this message in context:
            http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
            Sent from the hdf-forum mailing list archive at Nabble.com.

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [hidden email]
            <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
            http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [hidden email]
        <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________ Hdf-forum is for
    HDF software users discussion.
    [hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    ------------------------------------------------------------------------
    If you reply to this email, your message will be added to the
    discussion below:
    http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html

    To unsubscribe from Slow Reading 721GB File in Parallel, click here.
    NAML
    <http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------------------------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel <http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
Sent from the hdf-forum mailing list archive <http://hdf-forum.184993.n3.nabble.com/> at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Ok, the total data size is constant, and I am dividing it to 4096 parts no
matter how many processes I use, so the dataset is fully read only with
4096 processes. If I am only using 16 processes, the dataset will only be
read 16 parts out of 4096 parts.

Does that clarify what I am doing here?

···

On May 30, 2012 12:49 PM, "Mohamad Chaarawi" <chaarawi@hdfgroup.org> wrote:

The selection of each process actually stays the same size since the
region_count is not changing.

Ok, let me understand this again:
Your dataset size is constant (no matter what process count you execute
with), and processes are reading parts of the dataset.
When you are executing your program with say 16 processes, is your dataset
being divided equally (to some extent) among the 16 procs? When you
increase your process count to 36, is the dataset being divided equally
among 36 processes, meaning that the amount of data that a process reads
decreases as you scale, since the file size is the same?
If not, then this means you are reading parts of the dataset multiple
times as you scale, which makes the performance degradation expected. This
is like comparing the performance, in the serial case, of 1 read operation
to n read operations.
If yes, then move on to the second part..

the result of running "lfs getstripe filename | grep stripe" is:

  lmm_stripe_count: 4
  lmm_stripe_size: 1048576
lmm_stripe_offset: 286

The stripe count is way too small for ~1 TB byte.. your system
administrator should have some guidelines on what the stripe count and size
should be for certain file sizes. I would check that, and readjust those
parameters accordingly.

Thanks,
Mohamad

  Let me confirm with the second question.

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum] <[hidden > email] <http://user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

Hi Yucong ,

On 5/30/2012 12:33 PM, Yucong Ye wrote:

The region_index changes according to the mpi rank while the region_count
stays the same, which is 16,16,16.

Ok, I just needed to make sure that the selections for each process are
done such that it is compatible with scaling being done (as the number of
processes increase, the selection of each process decreases accordingly)..
The performance numbers you provided are indeed troubling, but it could be
for several reasons, some being:

   - The stripe size & count of your file on Lustre could be too small.
   Although this is a read operation (no file locking is done by the OSTs),
   increasing the number of io processes puts too much burden on the OSTs.
   Could you check those 2 parameters of your file? you can do that by running
   this on the command line:
    - lfs getstripe filename | grep stripe
       - The MPI-I/O implementation is not doing aggregation. If you are
   using ROMIO, two phase should do this for you which sets the default to the
   number of nodes (not processes). I would also try and increase the
   cb_buffer_size (default is 4MBs).

Thanks,
Mohamad

  On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023015&i=0>> >> wrote:

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are
you just reading the same data on all processes?

Mohamad

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more
nodes, it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is
Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't
finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i< 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1],
region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

--
View this message in context:
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
  If you reply to this email, your message will be added to the
discussion below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html
  To unsubscribe from Slow Reading 721GB File in Parallel, click here.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
Sent from the hdf-forum mailing list archive<http://hdf-forum.184993.n3.nabble.com/>at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@hdfgroup.orghttp://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Yucong,

Ok, the total data size is constant, and I am dividing it to 4096 parts no matter how many processes I use, so the dataset is fully read only with 4096 processes. If I am only using 16 processes, the dataset will only be read 16 parts out of 4096 parts.

Does that clarify what I am doing here?

ok I understand now.. thanks for clarifying this..
But again, since you are reading more data as you scale, you will probably get slower performance, especially if your selections for all processes are non-contiguous in file.
The stripe size & count are also major issues you need to address as I mentioned in my previous email.

Mohamad

···

On 5/30/2012 3:00 PM, Yucong Ye wrote:

On May 30, 2012 12:49 PM, "Mohamad Chaarawi" <chaarawi@hdfgroup.org > <mailto:chaarawi@hdfgroup.org>> wrote:

    The selection of each process actually stays the same size since
    the region_count is not changing.

    Ok, let me understand this again:
    Your dataset size is constant (no matter what process count you
    execute with), and processes are reading parts of the dataset.
    When you are executing your program with say 16 processes, is your
    dataset being divided equally (to some extent) among the 16 procs?
    When you increase your process count to 36, is the dataset being
    divided equally among 36 processes, meaning that the amount of
    data that a process reads decreases as you scale, since the file
    size is the same?
    If not, then this means you are reading parts of the dataset
    multiple times as you scale, which makes the performance
    degradation expected. This is like comparing the performance, in
    the serial case, of 1 read operation to n read operations.
    If yes, then move on to the second part..

    the result of running "lfs getstripe filename | grep stripe" is:

        lmm_stripe_count: 4
        lmm_stripe_size: 1048576
        lmm_stripe_offset: 286

    The stripe count is way too small for ~1 TB byte.. your system
    administrator should have some guidelines on what the stripe count
    and size should be for certain file sizes. I would check that, and
    readjust those parameters accordingly.

    Thanks,
    Mohamad

    Let me confirm with the second question.

    On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via >> hdf-forum] <[hidden email] >> <http://user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

        Hi Yucong ,

        On 5/30/2012 12:33 PM, Yucong Ye wrote:

        The region_index changes according to the mpi rank while the
        region_count stays the same, which is 16,16,16.

        Ok, I just needed to make sure that the selections for each
        process are done such that it is compatible with scaling
        being done (as the number of processes increase, the
        selection of each process decreases accordingly).. The
        performance numbers you provided are indeed troubling, but it
        could be for several reasons, some being:

          * The stripe size & count of your file on Lustre could be
            too small. Although this is a read operation (no file
            locking is done by the OSTs), increasing the number of io
            processes puts too much burden on the OSTs. Could you
            check those 2 parameters of your file? you can do that by
            running this on the command line:
              o lfs getstripe filename | grep stripe
          * The MPI-I/O implementation is not doing aggregation. If
            you are using ROMIO, two phase should do this for you
            which sets the default to the number of nodes (not
            processes). I would also try and increase the
            cb_buffer_size (default is 4MBs).

        Thanks,
        Mohamad

        On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=4023015&i=0>> wrote:

            Hi Chrisyeshi,

            Is the region_index & region_count the same on all
            processes? i.e. Are you just reading the same data on
            all processes?

            Mohamad

            On 5/29/2012 3:02 PM, chrisyeshi wrote:

                Hi,

                I am having trouble to read from a 721GB file using
                4096 nodes.
                When I test with a few nodes, it works, but when I
                test with more nodes, it
                takes significantly more time.
                What the test program does it only read in the data
                and deleting it.
                Here's the timing information:

                Nodes | Time For Running Entire Program
                16 4:28
                32 6:55
                64 8:56
                128 11:22
                256 13:25
                512 15:34

                768 28:34
                800 29:04

                I am running the program in a Cray XK6 system, and
                the file system is Lustre

                *There is a big gap after 512 nodes, and with 4096
                nodes, it couldn't finish
                in 6 hours.
                Is this normal? Shouldn't it be a lot faster?*

                Here is my reading function, it's similar to the
                sample hdf5 parallel
                program:

                #include<hdf5.h>
                #include<stdio.h>
                #include<stdlib.h>
                #include<assert.h>

                void readData(const char* filename, int
                region_index[3], int
                region_count[3], float* flow_field[6])
                {
                  char attributes[6][50];
                  sprintf(attributes[0], "/uvel");
                  sprintf(attributes[1], "/vvel");
                  sprintf(attributes[2], "/wvel");
                  sprintf(attributes[3], "/pressure");
                  sprintf(attributes[4], "/temp");
                  sprintf(attributes[5], "/OH");

                  herr_t status;
                  hid_t file_id;
                  hid_t dset_id;
                  hid_t dset_plist;
                  // open file spaces
                  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
                  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD,
                MPI_INFO_NULL);
                  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
                  status = H5Pclose(acc_tpl);
                  for (int i = 0; i< 6; ++i)
                  {
                    // open dataset
                    dset_id = H5Dopen(file_id, attributes[i],
                H5P_DEFAULT);

                    // get dataset space
                    hid_t spac_id = H5Dget_space(dset_id);
                    hsize_t htotal_size3[3];
                    status = H5Sget_simple_extent_dims(spac_id,
                htotal_size3, NULL);
                    hsize_t region_size3[3] = {htotal_size3[0] /
                region_count[0],
                                               htotal_size3[1] /
                region_count[1],
                                               htotal_size3[2] /
                region_count[2]};

                    // hyperslab
                    hsize_t start[3] = {region_index[0] *
                region_size3[0],
                                        region_index[1] *
                region_size3[1],
                                        region_index[2] *
                region_size3[2]};
                    hsize_t count[3] = {region_size3[0],
                region_size3[1], region_size3[2]};
                    status = H5Sselect_hyperslab(spac_id,
                H5S_SELECT_SET, start, NULL,
                count, NULL);
                    hid_t memspace = H5Screate_simple(3, count, NULL);

                    // read
                    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
                    status = H5Pset_dxpl_mpio(xfer_plist,
                H5FD_MPIO_COLLECTIVE);

                    flow_field[i] = (float *) malloc(count[0] *
                count[1] * count[2] *
                sizeof(float));
                    status = H5Dread(dset_id, H5T_NATIVE_FLOAT,
                memspace, spac_id,
                xfer_plist, flow_field[i]);

                    // clean up
                    H5Dclose(dset_id);
                    H5Sclose(spac_id);
                    H5Pclose(xfer_plist);
                  }
                  H5Fclose(file_id);
                }

                *Do you see any problem with this function? I am new
                to hdf5 parallel.*

                Thanks in advance!

                --
                View this message in context:
                http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
                Sent from the hdf-forum mailing list archive at
                Nabble.com.

                _______________________________________________
                Hdf-forum is for HDF software users discussion.
                [hidden email]
                <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
                http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [hidden email]
            <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
            http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

        _______________________________________________ Hdf-forum is
        for HDF software users discussion.
        [hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [hidden email]
        <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

        ------------------------------------------------------------------------
        If you reply to this email, your message will be added to the
        discussion below:
        http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html

        To unsubscribe from Slow Reading 721GB File in Parallel,
        click here.
        NAML
        <http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

    ------------------------------------------------------------------------
    View this message in context: Re: Slow Reading 721GB File in
    Parallel
    <http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
    Sent from the hdf-forum mailing list archive
    <http://hdf-forum.184993.n3.nabble.com/> at Nabble.com.

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@hdfgroup.org <mailto:Hdf-forum@hdfgroup.org>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@hdfgroup.org <mailto:Hdf-forum@hdfgroup.org>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

The documentation of the system I am using only describes how to change the
stripe size and stripe count. It doesn't provide guidelines about how many
it should be. What would be a common stripe count and stripe size values
for a ~1TB file?

···

On Wed, May 30, 2012 at 1:32 PM, Mohamad Chaarawi [via hdf-forum] < ml-node+s184993n4023424h8@n3.nabble.com> wrote:

Hi Yucong,

On 5/30/2012 3:00 PM, Yucong Ye wrote:

Ok, the total data size is constant, and I am dividing it to 4096 parts no
matter how many processes I use, so the dataset is fully read only with
4096 processes. If I am only using 16 processes, the dataset will only be
read 16 parts out of 4096 parts.

Does that clarify what I am doing here?

ok I understand now.. thanks for clarifying this..
But again, since you are reading more data as you scale, you will
probably get slower performance, especially if your selections for all
processes are non-contiguous in file.
The stripe size & count are also major issues you need to address as I
mentioned in my previous email.

Mohamad

On May 30, 2012 12:49 PM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023424&i=0>> > wrote:

The selection of each process actually stays the same size since the
region_count is not changing.

Ok, let me understand this again:
Your dataset size is constant (no matter what process count you execute
with), and processes are reading parts of the dataset.
When you are executing your program with say 16 processes, is your
dataset being divided equally (to some extent) among the 16 procs? When you
increase your process count to 36, is the dataset being divided equally
among 36 processes, meaning that the amount of data that a process reads
decreases as you scale, since the file size is the same?
If not, then this means you are reading parts of the dataset multiple
times as you scale, which makes the performance degradation expected. This
is like comparing the performance, in the serial case, of 1 read operation
to n read operations.
If yes, then move on to the second part..

the result of running "lfs getstripe filename | grep stripe" is:

  lmm_stripe_count: 4
  lmm_stripe_size: 1048576
lmm_stripe_offset: 286

The stripe count is way too small for ~1 TB byte.. your system
administrator should have some guidelines on what the stripe count and size
should be for certain file sizes. I would check that, and readjust those
parameters accordingly.

Thanks,
Mohamad

  Let me confirm with the second question.

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum] <[hidden >> email] <http://user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

Hi Yucong ,

On 5/30/2012 12:33 PM, Yucong Ye wrote:

The region_index changes according to the mpi rank while the
region_count stays the same, which is 16,16,16.

Ok, I just needed to make sure that the selections for each process
are done such that it is compatible with scaling being done (as the number
of processes increase, the selection of each process decreases
accordingly).. The performance numbers you provided are indeed troubling,
but it could be for several reasons, some being:

   - The stripe size & count of your file on Lustre could be too small.
   Although this is a read operation (no file locking is done by the OSTs),
   increasing the number of io processes puts too much burden on the OSTs.
   Could you check those 2 parameters of your file? you can do that by running
   this on the command line:
    - lfs getstripe filename | grep stripe
       - The MPI-I/O implementation is not doing aggregation. If you
   are using ROMIO, two phase should do this for you which sets the default to
   the number of nodes (not processes). I would also try and increase the
   cb_buffer_size (default is 4MBs).

Thanks,
Mohamad

  On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023015&i=0>> >>> wrote:

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are
you just reading the same data on all processes?

Mohamad

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more
nodes, it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is
Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't
finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i< 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1],
region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5 parallel.*

Thanks in advance!

--
View this message in context:
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
  If you reply to this email, your message will be added to the
discussion below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html
  To unsubscribe from Slow Reading 721GB File in Parallel, click here.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
Sent from the hdf-forum mailing list archive<http://hdf-forum.184993.n3.nabble.com/>at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=1>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
If you reply to this email, your message will be added to the discussion
below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023424.html
To unsubscribe from Slow Reading 721GB File in Parallel, click here<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4021429&code=Y2hyaXN5ZXNoaUBnbWFpbC5jb218NDAyMTQyOXwxMTg1MjYxNzA=>
.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023736.html
Sent from the hdf-forum mailing list archive at Nabble.com.

The documentation of the system I am using only describes how to change the stripe size and stripe count. It doesn't provide guidelines about how many it should be. What would be a common stripe count and stripe size values for a ~1TB file?

I would go with the maximum available for the stripe count. You can try and experiment with the stripe size, maybe 32 MB would be good..
Increasing ROMIO's cb_buffer_size through an MPI Info hint is also worth trying.

Mohamad

···

On 5/30/2012 5:27 PM, chrisyeshi wrote:

On Wed, May 30, 2012 at 1:32 PM, Mohamad Chaarawi [via hdf-forum] > <[hidden email] </user/SendEmail.jtp?type=node&node=4023736&i=0>> wrote:

    Hi Yucong,

    On 5/30/2012 3:00 PM, Yucong Ye wrote:

    Ok, the total data size is constant, and I am dividing it to 4096
    parts no matter how many processes I use, so the dataset is fully
    read only with 4096 processes. If I am only using 16 processes,
    the dataset will only be read 16 parts out of 4096 parts.

    Does that clarify what I am doing here?

    ok I understand now.. thanks for clarifying this..
    But again, since you are reading more data as you scale, you will
    probably get slower performance, especially if your selections
    for all processes are non-contiguous in file.
    The stripe size & count are also major issues you need to address
    as I mentioned in my previous email.

    Mohamad

    On May 30, 2012 12:49 PM, "Mohamad Chaarawi" <[hidden email] >> <http://user/SendEmail.jtp?type=node&node=4023424&i=0>> wrote:

        The selection of each process actually stays the same size
        since the region_count is not changing.

        Ok, let me understand this again:
        Your dataset size is constant (no matter what process count
        you execute with), and processes are reading parts of the
        dataset.
        When you are executing your program with say 16 processes, is
        your dataset being divided equally (to some extent) among the
        16 procs? When you increase your process count to 36, is the
        dataset being divided equally among 36 processes, meaning
        that the amount of data that a process reads decreases as you
        scale, since the file size is the same?
        If not, then this means you are reading parts of the dataset
        multiple times as you scale, which makes the performance
        degradation expected. This is like comparing the performance,
        in the serial case, of 1 read operation to n read operations.
        If yes, then move on to the second part..

        the result of running "lfs getstripe filename | grep stripe" is:

            lmm_stripe_count: 4
            lmm_stripe_size: 1048576
            lmm_stripe_offset: 286

        The stripe count is way too small for ~1 TB byte.. your
        system administrator should have some guidelines on what the
        stripe count and size should be for certain file sizes. I
        would check that, and readjust those parameters accordingly.

        Thanks,
        Mohamad

        Let me confirm with the second question.

        On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via >>> hdf-forum] <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

            Hi Yucong ,

            On 5/30/2012 12:33 PM, Yucong Ye wrote:

            The region_index changes according to the mpi rank
            while the region_count stays the same, which is 16,16,16.

            Ok, I just needed to make sure that the selections for
            each process are done such that it is compatible with
            scaling being done (as the number of processes increase,
            the selection of each process decreases accordingly)..
            The performance numbers you provided are indeed
            troubling, but it could be for several reasons, some being:

              * The stripe size & count of your file on Lustre could
                be too small. Although this is a read operation (no
                file locking is done by the OSTs), increasing the
                number of io processes puts too much burden on the
                OSTs. Could you check those 2 parameters of your
                file? you can do that by running this on the command
                line:
                  o lfs getstripe filename | grep stripe
              * The MPI-I/O implementation is not doing aggregation.
                If you are using ROMIO, two phase should do this for
                you which sets the default to the number of nodes
                (not processes). I would also try and increase the
                cb_buffer_size (default is 4MBs).

            Thanks,
            Mohamad

            On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden >>>> email] >>>> <http://user/SendEmail.jtp?type=node&node=4023015&i=0>> >>>> wrote:

                Hi Chrisyeshi,

                Is the region_index & region_count the same on all
                processes? i.e. Are you just reading the same data
                on all processes?

                Mohamad

                On 5/29/2012 3:02 PM, chrisyeshi wrote:

                    Hi,

                    I am having trouble to read from a 721GB file
                    using 4096 nodes.
                    When I test with a few nodes, it works, but
                    when I test with more nodes, it
                    takes significantly more time.
                    What the test program does it only read in the
                    data and deleting it.
                    Here's the timing information:

                    Nodes | Time For Running Entire Program
                    16 4:28
                    32 6:55
                    64 8:56
                    128 11:22
                    256 13:25
                    512 15:34

                    768 28:34
                    800 29:04

                    I am running the program in a Cray XK6 system,
                    and the file system is Lustre

                    *There is a big gap after 512 nodes, and with
                    4096 nodes, it couldn't finish
                    in 6 hours.
                    Is this normal? Shouldn't it be a lot faster?*

                    Here is my reading function, it's similar to
                    the sample hdf5 parallel
                    program:

                    #include<hdf5.h>
                    #include<stdio.h>
                    #include<stdlib.h>
                    #include<assert.h>

                    void readData(const char* filename, int
                    region_index[3], int
                    region_count[3], float* flow_field[6])
                    {
                      char attributes[6][50];
                      sprintf(attributes[0], "/uvel");
                      sprintf(attributes[1], "/vvel");
                      sprintf(attributes[2], "/wvel");
                      sprintf(attributes[3], "/pressure");
                      sprintf(attributes[4], "/temp");
                      sprintf(attributes[5], "/OH");

                      herr_t status;
                      hid_t file_id;
                      hid_t dset_id;
                      hid_t dset_plist;
                      // open file spaces
                      hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
                      status = H5Pset_fapl_mpio(acc_tpl,
                    MPI_COMM_WORLD, MPI_INFO_NULL);
                      file_id = H5Fopen(filename, H5F_ACC_RDONLY,
                    acc_tpl);
                      status = H5Pclose(acc_tpl);
                      for (int i = 0; i< 6; ++i)
                      {
                        // open dataset
                        dset_id = H5Dopen(file_id, attributes[i],
                    H5P_DEFAULT);

                        // get dataset space
                        hid_t spac_id = H5Dget_space(dset_id);
                        hsize_t htotal_size3[3];
                        status = H5Sget_simple_extent_dims(spac_id,
                    htotal_size3, NULL);
                        hsize_t region_size3[3] = {htotal_size3[0]
                    / region_count[0],
                       htotal_size3[1] / region_count[1],
                       htotal_size3[2] / region_count[2]};

                        // hyperslab
                        hsize_t start[3] = {region_index[0] *
                    region_size3[0],
                    region_index[1] * region_size3[1],
                    region_index[2] * region_size3[2]};
                        hsize_t count[3] = {region_size3[0],
                    region_size3[1], region_size3[2]};
                        status = H5Sselect_hyperslab(spac_id,
                    H5S_SELECT_SET, start, NULL,
                    count, NULL);
                        hid_t memspace = H5Screate_simple(3, count,
                    NULL);

                        // read
                        hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
                        status = H5Pset_dxpl_mpio(xfer_plist,
                    H5FD_MPIO_COLLECTIVE);

                        flow_field[i] = (float *) malloc(count[0] *
                    count[1] * count[2] *
                    sizeof(float));
                        status = H5Dread(dset_id, H5T_NATIVE_FLOAT,
                    memspace, spac_id,
                    xfer_plist, flow_field[i]);

                        // clean up
                        H5Dclose(dset_id);
                        H5Sclose(spac_id);
                        H5Pclose(xfer_plist);
                      }
                      H5Fclose(file_id);
                    }

                    *Do you see any problem with this function? I
                    am new to hdf5 parallel.*

                    Thanks in advance!

                    --
                    View this message in context:
                    http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
                    Sent from the hdf-forum mailing list archive at
                    Nabble.com.

                    _______________________________________________
                    Hdf-forum is for HDF software users discussion.
                    [hidden email]
                    <http://user/SendEmail.jtp?type=node&node=4023015&i=1>
                    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

                _______________________________________________
                Hdf-forum is for HDF software users discussion.
                [hidden email]
                <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
                http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>
            http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [hidden email]
            <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
            http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

            ------------------------------------------------------------------------
            If you reply to this email, your message will be added
            to the discussion below:
            http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html

            To unsubscribe from Slow Reading 721GB File in Parallel,
            click here.
            NAML
            <http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

        ------------------------------------------------------------------------
        View this message in context: Re: Slow Reading 721GB File in
        Parallel
        <http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
        Sent from the hdf-forum mailing list archive
        <http://hdf-forum.184993.n3.nabble.com/> at Nabble.com.

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=1>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [hidden email]
        <http://user/SendEmail.jtp?type=node&node=4023424&i=2>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=3>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=4>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    ------------------------------------------------------------------------
    If you reply to this email, your message will be added to the
    discussion below:
    http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023424.html

    To unsubscribe from Slow Reading 721GB File in Parallel, click here.
    NAML
    <http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------------------------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel <http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023736.html>
Sent from the hdf-forum mailing list archive <http://hdf-forum.184993.n3.nabble.com/> at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thank you very much!

I changed the stripe count to maximum and left the stripe size as 1M, and I
could finish reading the dataset with 4096 processes in 20 minutes.
That's good enough for me. When I have time, I might test with different
stripe size too.

Thanks again!

···

On Wed, May 30, 2012 at 4:52 PM, Mohamad Chaarawi [via hdf-forum] < ml-node+s184993n4023905h19@n3.nabble.com> wrote:

On 5/30/2012 5:27 PM, chrisyeshi wrote:

The documentation of the system I am using only describes how to change
the stripe size and stripe count. It doesn't provide guidelines about how
many it should be. What would be a common stripe count and stripe size
values for a ~1TB file?

I would go with the maximum available for the stripe count. You can try
and experiment with the stripe size, maybe 32 MB would be good..
Increasing ROMIO's cb_buffer_size through an MPI Info hint is also worth
trying.

Mohamad

On Wed, May 30, 2012 at 1:32 PM, Mohamad Chaarawi [via hdf-forum] <[hidden > email] <http://user/SendEmail.jtp?type=node&node=4023736&i=0>> wrote:

Hi Yucong,

On 5/30/2012 3:00 PM, Yucong Ye wrote:

Ok, the total data size is constant, and I am dividing it to 4096 parts
no matter how many processes I use, so the dataset is fully read only with
4096 processes. If I am only using 16 processes, the dataset will only be
read 16 parts out of 4096 parts.

Does that clarify what I am doing here?

ok I understand now.. thanks for clarifying this..
But again, since you are reading more data as you scale, you will
probably get slower performance, especially if your selections for all
processes are non-contiguous in file.
The stripe size & count are also major issues you need to address as I
mentioned in my previous email.

Mohamad

On May 30, 2012 12:49 PM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023424&i=0>> >> wrote:

The selection of each process actually stays the same size since the
region_count is not changing.

Ok, let me understand this again:
Your dataset size is constant (no matter what process count you execute
with), and processes are reading parts of the dataset.
When you are executing your program with say 16 processes, is your
dataset being divided equally (to some extent) among the 16 procs? When you
increase your process count to 36, is the dataset being divided equally
among 36 processes, meaning that the amount of data that a process reads
decreases as you scale, since the file size is the same?
If not, then this means you are reading parts of the dataset multiple
times as you scale, which makes the performance degradation expected. This
is like comparing the performance, in the serial case, of 1 read operation
to n read operations.
If yes, then move on to the second part..

the result of running "lfs getstripe filename | grep stripe" is:

  lmm_stripe_count: 4
  lmm_stripe_size: 1048576
lmm_stripe_offset: 286

The stripe count is way too small for ~1 TB byte.. your system
administrator should have some guidelines on what the stripe count and size
should be for certain file sizes. I would check that, and readjust those
parameters accordingly.

Thanks,
Mohamad

  Let me confirm with the second question.

On Wed, May 30, 2012 at 11:01 AM, Mohamad Chaarawi [via hdf-forum] <[hidden >>> email] <http://user/SendEmail.jtp?type=node&node=4023160&i=0>> wrote:

Hi Yucong ,

On 5/30/2012 12:33 PM, Yucong Ye wrote:

The region_index changes according to the mpi rank while the
region_count stays the same, which is 16,16,16.

Ok, I just needed to make sure that the selections for each process
are done such that it is compatible with scaling being done (as the number
of processes increase, the selection of each process decreases
accordingly).. The performance numbers you provided are indeed troubling,
but it could be for several reasons, some being:

   - The stripe size & count of your file on Lustre could be too
   small. Although this is a read operation (no file locking is done by the
   OSTs), increasing the number of io processes puts too much burden on the
   OSTs. Could you check those 2 parameters of your file? you can do that by
   running this on the command line:
    - lfs getstripe filename | grep stripe
       - The MPI-I/O implementation is not doing aggregation. If you
   are using ROMIO, two phase should do this for you which sets the default to
   the number of nodes (not processes). I would also try and increase the
   cb_buffer_size (default is 4MBs).

Thanks,
Mohamad

  On May 30, 2012 8:19 AM, "Mohamad Chaarawi" <[hidden email]<http://user/SendEmail.jtp?type=node&node=4023015&i=0>> >>>> wrote:

Hi Chrisyeshi,

Is the region_index & region_count the same on all processes? i.e. Are
you just reading the same data on all processes?

Mohamad

On 5/29/2012 3:02 PM, chrisyeshi wrote:

Hi,

I am having trouble to read from a 721GB file using 4096 nodes.
When I test with a few nodes, it works, but when I test with more
nodes, it
takes significantly more time.
What the test program does it only read in the data and deleting it.
Here's the timing information:

Nodes | Time For Running Entire Program
16 4:28
32 6:55
64 8:56
128 11:22
256 13:25
512 15:34

768 28:34
800 29:04

I am running the program in a Cray XK6 system, and the file system is
Lustre

*There is a big gap after 512 nodes, and with 4096 nodes, it couldn't
finish
in 6 hours.
Is this normal? Shouldn't it be a lot faster?*

Here is my reading function, it's similar to the sample hdf5 parallel
program:

#include<hdf5.h>
#include<stdio.h>
#include<stdlib.h>
#include<assert.h>

void readData(const char* filename, int region_index[3], int
region_count[3], float* flow_field[6])
{
  char attributes[6][50];
  sprintf(attributes[0], "/uvel");
  sprintf(attributes[1], "/vvel");
  sprintf(attributes[2], "/wvel");
  sprintf(attributes[3], "/pressure");
  sprintf(attributes[4], "/temp");
  sprintf(attributes[5], "/OH");

  herr_t status;
  hid_t file_id;
  hid_t dset_id;
  hid_t dset_plist;
  // open file spaces
  hid_t acc_tpl = H5Pcreate(H5P_FILE_ACCESS);
  status = H5Pset_fapl_mpio(acc_tpl, MPI_COMM_WORLD, MPI_INFO_NULL);
  file_id = H5Fopen(filename, H5F_ACC_RDONLY, acc_tpl);
  status = H5Pclose(acc_tpl);
  for (int i = 0; i< 6; ++i)
  {
    // open dataset
    dset_id = H5Dopen(file_id, attributes[i], H5P_DEFAULT);

    // get dataset space
    hid_t spac_id = H5Dget_space(dset_id);
    hsize_t htotal_size3[3];
    status = H5Sget_simple_extent_dims(spac_id, htotal_size3, NULL);
    hsize_t region_size3[3] = {htotal_size3[0] / region_count[0],
                               htotal_size3[1] / region_count[1],
                               htotal_size3[2] / region_count[2]};

    // hyperslab
    hsize_t start[3] = {region_index[0] * region_size3[0],
                        region_index[1] * region_size3[1],
                        region_index[2] * region_size3[2]};
    hsize_t count[3] = {region_size3[0], region_size3[1],
region_size3[2]};
    status = H5Sselect_hyperslab(spac_id, H5S_SELECT_SET, start, NULL,
count, NULL);
    hid_t memspace = H5Screate_simple(3, count, NULL);

    // read
    hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
    status = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);

    flow_field[i] = (float *) malloc(count[0] * count[1] * count[2] *
sizeof(float));
    status = H5Dread(dset_id, H5T_NATIVE_FLOAT, memspace, spac_id,
xfer_plist, flow_field[i]);

    // clean up
    H5Dclose(dset_id);
    H5Sclose(spac_id);
    H5Pclose(xfer_plist);
  }
  H5Fclose(file_id);
}

*Do you see any problem with this function? I am new to hdf5
parallel.*

Thanks in advance!

--
View this message in context:
http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]<http://user/SendEmail.jtp?type=node&node=4023015&i=1>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023015&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
  If you reply to this email, your message will be added to the
discussion below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023015.html
  To unsubscribe from Slow Reading 721GB File in Parallel, click here.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023160.html>
Sent from the hdf-forum mailing list archive<http://hdf-forum.184993.n3.nabble.com/>at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=1>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=2>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=3>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023424&i=4>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
   If you reply to this email, your message will be added to the
discussion below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023424.html
  To unsubscribe from Slow Reading 721GB File in Parallel, click here.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

------------------------------
View this message in context: Re: Slow Reading 721GB File in Parallel<http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023736.html>
Sent from the hdf-forum mailing list archive<http://hdf-forum.184993.n3.nabble.com/>at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.[hidden email] <http://user/SendEmail.jtp?type=node&node=4023905&i=0>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email] <http://user/SendEmail.jtp?type=node&node=4023905&i=1>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

------------------------------
If you reply to this email, your message will be added to the discussion
below:

http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4023905.html
To unsubscribe from Slow Reading 721GB File in Parallel, click here<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4021429&code=Y2hyaXN5ZXNoaUBnbWFpbC5jb218NDAyMTQyOXwxMTg1MjYxNzA=>
.
NAML<http://hdf-forum.184993.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml>

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Slow-Reading-721GB-File-in-Parallel-tp4021429p4024713.html
Sent from the hdf-forum mailing list archive at Nabble.com.