H5Dread performance issue with non-contiguous hyperslab selections

Hi,

I think I've come across a performance issue with H5Dread when reading
non-contiguous hyperslab selections. The use case in my software is a bit
complicated, so instead I came up with a small example that shows the same
issue. Please let me know if I'm missing something here, it's possible
that a different approach could be much better.

In my example I write a 2D native int chunked dataset to an HDF5 file
(adapted from the h5_extend example, now writes a 229 MB file). I then
construct a hyperslab selection of the dataset and read it back using a
single call to H5Dread. When I use a stride of 1 (so all elements of the
selection are contiguous) the read is very fast. However, when I set the
stride to 2 the read time slows down significantly, on the order of 15
times slower.

The dataset has a chunk shape of 1000x500, and the 0th dimension is the one
being tested with a stride of 1 and 2. Is this a typical slowdown seen
with a stride of 2? If the chunksize is 1000, then a stride of 1 and 2
would still need to read the same amount of data, so I would expect similar
performance.

I've run the stride of 2 scenario under Valgrind (using the callgrind tool)
for profiling and it shows that 95% of the time is being spent in
H5S_select_iterate (I can share the callgrind output if it helps), which is
making this program CPU bound not I/O bound.

I'm using an up to date version of HDF5 trunk from checked out from
subversion. I looked at the callback H5D__chunk_io_init() used by
H5S_select_iterate(). I noticed that there are two different approaches
taken, one for the case where the shape of the memory space is the same as
the dataspace, and another if the shapes are different. The performance
drop I've noticed appears to be for the latter case.

Any ideas on how to optimize this function or otherwise increase the
performance of this use case?

Thanks,
Chris LeBlanc

···

--

Here is the example code. I wrote this mail earlier and included it as an
attachment and haven't seen it appear on the mailing list so I'm trying
again with the text inline:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* Copyright by The HDF Group.
  *
* Copyright by the Board of Trustees of the University of Illinois.
  *
* All rights reserved.
*
*
  *
* This file is part of HDF5. The full HDF5 copyright notice, including
  *
* terms governing use, modification, and redistribution, is contained in
*
* the files COPYING and Copyright.html. COPYING can be found at the root
  *
* of the source code distribution tree; Copyright.html can be found at the
*
* root level of an installed copy of the electronic HDF5 document set and
  *
* is linked from the top-level documents page. It can also be found at
  *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have
*
* access to either file, you may request a copy from help@hdfgroup.org.
  *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* */

/*
* This example how to work with extendible datasets. The dataset
* must be chunked in order to be extendible.
*
* It is used in the HDF5 Tutorial.
*/

// Modified example of h5_extend.c to show performance difference between
reading with a stride of 1 vs 2:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "hdf5.h"

#define FILE "extend.h5"
#define DATASETNAME "ExtendibleArray"
#define RANK 2

void write_file() {
    hid_t file; /* handles */
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hid_t cparms;

    hsize_t dims[2] = {20000, 3000}; /* dataset dimensions
at creation time */
    hsize_t maxdims[2] = {H5S_UNLIMITED, H5S_UNLIMITED};
    herr_t status;
    hsize_t chunk_dims[2] = {1000, 500};
    int *data = calloc(dims[0]*dims[1], sizeof(int));

    /* Variables used in reading data back */
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t i, j;
    int *datar = calloc(dims[0]*dims[1], sizeof(int));
    herr_t status_n;
    int rank, rank_chunk;

    /* Create the data space with unlimited dimensions. */
    dataspace = H5Screate_simple (RANK, dims, maxdims);

    /* Create a new file. If file exists its contents will be overwritten.
*/
    file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk (cparms, RANK, chunk_dims);

    /* Create a new dataset within the file using cparms
       creation properties. */
    dataset = H5Dcreate2 (file, DATASETNAME, H5T_NATIVE_INT, dataspace,
                         H5P_DEFAULT, cparms, H5P_DEFAULT);

    status = H5Sclose (dataspace);

    /* Write data to dataset */
    status = H5Dwrite (dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                       H5P_DEFAULT, data);

    /* Close resources */
    status = H5Dclose (dataset);
    status = H5Fclose (file);
    status = H5Pclose (cparms);
    free(data);

}

void read_file(hsize_t dim1_stride, hsize_t dim2_stride) {

    /* Variables used in reading data back */
    hid_t file;
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t memspace_dims[2];
    hsize_t i, j;
    int *datar;
    hsize_t mem_offsets[2] = {0, 0};
    hsize_t strides[2] = {dim1_stride, dim2_stride};
    hsize_t count[2];
    herr_t status_n;
    int rank_chunk;

    file = H5Fopen (FILE, H5F_ACC_RDONLY, H5P_DEFAULT);
    dataset = H5Dopen2 (file, DATASETNAME, H5P_DEFAULT);

    filespace = H5Dget_space (dataset);

    //rank = H5Sget_simple_extent_ndims (filespace);
    status_n = H5Sget_simple_extent_dims (filespace, dimsr, NULL);

    memspace_dims[0] = dimsr[0] / strides[0];
    memspace_dims[1] = dimsr[1];
    memspace = H5Screate_simple (RANK, memspace_dims, NULL);

    count[0] = dimsr[0] / strides[0];
    count[1] = dimsr[1];

    // core of this test: a hyperslab with varying stride:
    H5Sselect_hyperslab( filespace, H5S_SELECT_SET, mem_offsets, strides,
count, NULL );

    datar = calloc(memspace_dims[0]*memspace_dims[1], sizeof(int));

    printf("reading with stride = %d, memspace_dims: %d %d, count: %d
%d\n", (int) strides[0], (int) memspace_dims[0], (int) memspace_dims[1],
(int) count[0], (int) count[1]);

    time_t t1 = time(NULL);
    int status = H5Dread (dataset, H5T_NATIVE_INT, memspace, filespace,
                      H5P_DEFAULT, datar);

    time_t t2 = time(NULL);
    printf("done reading with stride = %d, time = %d (nearest sec)\n",
(int) strides[1], (int) (t2-t1) );

    status = H5Dclose (dataset);
    status = H5Sclose (filespace);
    status = H5Sclose (memspace);
    status = H5Fclose (file);
    free(datar);
}

int main (void)
{
    write_file();
    read_file(1, 1);
    read_file(2, 1);
}

Chris,

This is a known problem.

We entered the issue you reported into our database to make sure it is on the radar for the HDF5 improvements. Unfortunately, making it to work will require a substantial effort for a general case, but we probably will be able to improve performance for some simple patterns like yours.

We plan to rework a hyperslab selection algorithm (going from O(n^2) to O(1)) in HDF5 1.8.11. This should help, but there still will be the cases when performance is bad due to the fact that we are "touching every pixel" while building a general selection.

Bottom line: if you want good I/O, don't use a non-contigous selections ;-(

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 6, 2012, at 6:37 PM, Chris LeBlanc wrote:

Hi,

I think I've come across a performance issue with H5Dread when reading non-contiguous hyperslab selections. The use case in my software is a bit complicated, so instead I came up with a small example that shows the same issue. Please let me know if I'm missing something here, it's possible that a different approach could be much better.

In my example I write a 2D native int chunked dataset to an HDF5 file (adapted from the h5_extend example, now writes a 229 MB file). I then construct a hyperslab selection of the dataset and read it back using a single call to H5Dread. When I use a stride of 1 (so all elements of the selection are contiguous) the read is very fast. However, when I set the stride to 2 the read time slows down significantly, on the order of 15 times slower.

The dataset has a chunk shape of 1000x500, and the 0th dimension is the one being tested with a stride of 1 and 2. Is this a typical slowdown seen with a stride of 2? If the chunksize is 1000, then a stride of 1 and 2 would still need to read the same amount of data, so I would expect similar performance.

I've run the stride of 2 scenario under Valgrind (using the callgrind tool) for profiling and it shows that 95% of the time is being spent in H5S_select_iterate (I can share the callgrind output if it helps), which is making this program CPU bound not I/O bound.

I'm using an up to date version of HDF5 trunk from checked out from subversion. I looked at the callback H5D__chunk_io_init() used by H5S_select_iterate(). I noticed that there are two different approaches taken, one for the case where the shape of the memory space is the same as the dataspace, and another if the shapes are different. The performance drop I've noticed appears to be for the latter case.

Any ideas on how to optimize this function or otherwise increase the performance of this use case?

Thanks,
Chris LeBlanc

--

Here is the example code. I wrote this mail earlier and included it as an attachment and haven't seen it appear on the mailing list so I'm trying again with the text inline:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by The HDF Group. *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the files COPYING and Copyright.html. COPYING can be found at the root *
* of the source code distribution tree; Copyright.html can be found at the *
* root level of an installed copy of the electronic HDF5 document set and *
* is linked from the top-level documents page. It can also be found at *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have *
* access to either file, you may request a copy from help@hdfgroup.org. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

/*
* This example how to work with extendible datasets. The dataset
* must be chunked in order to be extendible.
*
* It is used in the HDF5 Tutorial.
*/

// Modified example of h5_extend.c to show performance difference between reading with a stride of 1 vs 2:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "hdf5.h"

#define FILE "extend.h5"
#define DATASETNAME "ExtendibleArray"
#define RANK 2

void write_file() {
    hid_t file; /* handles */
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hid_t cparms;

    hsize_t dims[2] = {20000, 3000}; /* dataset dimensions at creation time */
    hsize_t maxdims[2] = {H5S_UNLIMITED, H5S_UNLIMITED};
    herr_t status;
    hsize_t chunk_dims[2] = {1000, 500};
    int *data = calloc(dims[0]*dims[1], sizeof(int));

    /* Variables used in reading data back */
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t i, j;
    int *datar = calloc(dims[0]*dims[1], sizeof(int));
    herr_t status_n;
    int rank, rank_chunk;

    /* Create the data space with unlimited dimensions. */
    dataspace = H5Screate_simple (RANK, dims, maxdims);

    /* Create a new file. If file exists its contents will be overwritten. */
    file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk (cparms, RANK, chunk_dims);

    /* Create a new dataset within the file using cparms
       creation properties. */
    dataset = H5Dcreate2 (file, DATASETNAME, H5T_NATIVE_INT, dataspace,
                         H5P_DEFAULT, cparms, H5P_DEFAULT);

    status = H5Sclose (dataspace);

    /* Write data to dataset */
    status = H5Dwrite (dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                       H5P_DEFAULT, data);

    /* Close resources */
    status = H5Dclose (dataset);
    status = H5Fclose (file);
    status = H5Pclose (cparms);
    free(data);
   
}

void read_file(hsize_t dim1_stride, hsize_t dim2_stride) {

    /* Variables used in reading data back */
    hid_t file;
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t memspace_dims[2];
    hsize_t i, j;
    int *datar;
    hsize_t mem_offsets[2] = {0, 0};
    hsize_t strides[2] = {dim1_stride, dim2_stride};
    hsize_t count[2];
    herr_t status_n;
    int rank_chunk;

    file = H5Fopen (FILE, H5F_ACC_RDONLY, H5P_DEFAULT);
    dataset = H5Dopen2 (file, DATASETNAME, H5P_DEFAULT);

    filespace = H5Dget_space (dataset);

    //rank = H5Sget_simple_extent_ndims (filespace);
    status_n = H5Sget_simple_extent_dims (filespace, dimsr, NULL);

    memspace_dims[0] = dimsr[0] / strides[0];
    memspace_dims[1] = dimsr[1];
    memspace = H5Screate_simple (RANK, memspace_dims, NULL);
    
    count[0] = dimsr[0] / strides[0];
    count[1] = dimsr[1];
    
    // core of this test: a hyperslab with varying stride:
    H5Sselect_hyperslab( filespace, H5S_SELECT_SET, mem_offsets, strides, count, NULL );
    
    datar = calloc(memspace_dims[0]*memspace_dims[1], sizeof(int));
    
    printf("reading with stride = %d, memspace_dims: %d %d, count: %d %d\n", (int) strides[0], (int) memspace_dims[0], (int) memspace_dims[1], (int) count[0], (int) count[1]);
    
    time_t t1 = time(NULL);
    int status = H5Dread (dataset, H5T_NATIVE_INT, memspace, filespace,
                      H5P_DEFAULT, datar);
    
    time_t t2 = time(NULL);
    printf("done reading with stride = %d, time = %d (nearest sec)\n", (int) strides[1], (int) (t2-t1) );
    
    status = H5Dclose (dataset);
    status = H5Sclose (filespace);
    status = H5Sclose (memspace);
    status = H5Fclose (file);
    free(datar);
}

int main (void)
{
    write_file();
    read_file(1, 1);
    read_file(2, 1);
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Elena,

Thanks for the info, and sorry for the double-post. It's good to know it's
being looked at.

I did find that performance is excellent with non-contiguous hyperslab
selections, as long as the two selections are the same. It will be a bit
messy, but I think I can implement a read pattern of a large dataset by
reading smaller subsets of it instead of the whole thing at once. This
will let me use identical selections and hopefully improve performance.
The downside is that it will use more ram than it currently does, it'll
call H5Dread many more times, and I'll have to do more memcpy's.

Thanks,
Chris

···

On Thu, Nov 8, 2012 at 12:35 PM, Elena Pourmal <epourmal@hdfgroup.org>wrote:

Chris,

This is a known problem.

We entered the issue you reported into our database to make sure it is on
the radar for the HDF5 improvements. Unfortunately, making it to work will
require a substantial effort for a general case, but we probably will be
able to improve performance for some simple patterns like yours.

We plan to rework a hyperslab selection algorithm (going from O(n^2) to
O(1)) in HDF5 1.8.11. This should help, but there still will be the cases
when performance is bad due to the fact that we are "touching every pixel"
while building a general selection.

Bottom line: if you want good I/O, don't use a non-contigous selections ;-(

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 6, 2012, at 6:37 PM, Chris LeBlanc wrote:

Hi,

I think I've come across a performance issue with H5Dread when reading
non-contiguous hyperslab selections. The use case in my software is a bit
complicated, so instead I came up with a small example that shows the same
issue. Please let me know if I'm missing something here, it's possible
that a different approach could be much better.

In my example I write a 2D native int chunked dataset to an HDF5 file
(adapted from the h5_extend example, now writes a 229 MB file). I then
construct a hyperslab selection of the dataset and read it back using a
single call to H5Dread. When I use a stride of 1 (so all elements of the
selection are contiguous) the read is very fast. However, when I set the
stride to 2 the read time slows down significantly, on the order of 15
times slower.

The dataset has a chunk shape of 1000x500, and the 0th dimension is the
one being tested with a stride of 1 and 2. Is this a typical slowdown seen
with a stride of 2? If the chunksize is 1000, then a stride of 1 and 2
would still need to read the same amount of data, so I would expect similar
performance.

I've run the stride of 2 scenario under Valgrind (using the callgrind
tool) for profiling and it shows that 95% of the time is being spent in
H5S_select_iterate (I can share the callgrind output if it helps), which is
making this program CPU bound not I/O bound.

I'm using an up to date version of HDF5 trunk from checked out from
subversion. I looked at the callback H5D__chunk_io_init() used by
H5S_select_iterate(). I noticed that there are two different approaches
taken, one for the case where the shape of the memory space is the same as
the dataspace, and another if the shapes are different. The performance
drop I've noticed appears to be for the latter case.

Any ideas on how to optimize this function or otherwise increase the
performance of this use case?

Thanks,
Chris LeBlanc

--

Here is the example code. I wrote this mail earlier and included it as an
attachment and haven't seen it appear on the mailing list so I'm trying
again with the text inline:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* Copyright by The HDF Group.
  *
* Copyright by the Board of Trustees of the University of Illinois.
  *
* All rights reserved.
   *
*
  *
* This file is part of HDF5. The full HDF5 copyright notice, including
  *
* terms governing use, modification, and redistribution, is contained in
   *
* the files COPYING and Copyright.html. COPYING can be found at the root
  *
* of the source code distribution tree; Copyright.html can be found at
the *
* root level of an installed copy of the electronic HDF5 document set and
  *
* is linked from the top-level documents page. It can also be found at
  *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have
   *
* access to either file, you may request a copy from help@hdfgroup.org.
    *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* */

/*
* This example how to work with extendible datasets. The dataset
* must be chunked in order to be extendible.
*
* It is used in the HDF5 Tutorial.
*/

// Modified example of h5_extend.c to show performance difference between
reading with a stride of 1 vs 2:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "hdf5.h"

#define FILE "extend.h5"
#define DATASETNAME "ExtendibleArray"
#define RANK 2

void write_file() {
    hid_t file; /* handles */
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hid_t cparms;

    hsize_t dims[2] = {20000, 3000}; /* dataset dimensions
at creation time */
    hsize_t maxdims[2] = {H5S_UNLIMITED, H5S_UNLIMITED};
    herr_t status;
    hsize_t chunk_dims[2] = {1000, 500};
    int *data = calloc(dims[0]*dims[1], sizeof(int));

    /* Variables used in reading data back */
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t i, j;
    int *datar = calloc(dims[0]*dims[1], sizeof(int));
    herr_t status_n;
    int rank, rank_chunk;

    /* Create the data space with unlimited dimensions. */
    dataspace = H5Screate_simple (RANK, dims, maxdims);

    /* Create a new file. If file exists its contents will be overwritten.
*/
    file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk (cparms, RANK, chunk_dims);

    /* Create a new dataset within the file using cparms
       creation properties. */
    dataset = H5Dcreate2 (file, DATASETNAME, H5T_NATIVE_INT, dataspace,
                         H5P_DEFAULT, cparms, H5P_DEFAULT);

    status = H5Sclose (dataspace);

    /* Write data to dataset */
    status = H5Dwrite (dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                       H5P_DEFAULT, data);

    /* Close resources */
    status = H5Dclose (dataset);
    status = H5Fclose (file);
    status = H5Pclose (cparms);
    free(data);

}

void read_file(hsize_t dim1_stride, hsize_t dim2_stride) {

    /* Variables used in reading data back */
    hid_t file;
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t memspace_dims[2];
    hsize_t i, j;
    int *datar;
    hsize_t mem_offsets[2] = {0, 0};
    hsize_t strides[2] = {dim1_stride, dim2_stride};
    hsize_t count[2];
    herr_t status_n;
    int rank_chunk;

    file = H5Fopen (FILE, H5F_ACC_RDONLY, H5P_DEFAULT);
    dataset = H5Dopen2 (file, DATASETNAME, H5P_DEFAULT);

    filespace = H5Dget_space (dataset);

    //rank = H5Sget_simple_extent_ndims (filespace);
    status_n = H5Sget_simple_extent_dims (filespace, dimsr, NULL);

    memspace_dims[0] = dimsr[0] / strides[0];
    memspace_dims[1] = dimsr[1];
    memspace = H5Screate_simple (RANK, memspace_dims, NULL);

    count[0] = dimsr[0] / strides[0];
    count[1] = dimsr[1];

    // core of this test: a hyperslab with varying stride:
    H5Sselect_hyperslab( filespace, H5S_SELECT_SET, mem_offsets, strides,
count, NULL );

    datar = calloc(memspace_dims[0]*memspace_dims[1], sizeof(int));

    printf("reading with stride = %d, memspace_dims: %d %d, count: %d
%d\n", (int) strides[0], (int) memspace_dims[0], (int) memspace_dims[1],
(int) count[0], (int) count[1]);

    time_t t1 = time(NULL);
    int status = H5Dread (dataset, H5T_NATIVE_INT, memspace, filespace,
                      H5P_DEFAULT, datar);

    time_t t2 = time(NULL);
    printf("done reading with stride = %d, time = %d (nearest sec)\n",
(int) strides[1], (int) (t2-t1) );

    status = H5Dclose (dataset);
    status = H5Sclose (filespace);
    status = H5Sclose (memspace);
    status = H5Fclose (file);
    free(datar);
}

int main (void)
{
    write_file();
    read_file(1, 1);
    read_file(2, 1);
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I looked into this problem a bit more and found a reasonable workaround.
Hopefully this will be of help to other people in this situation.

I noticed that using H5Dread() with two identical hyperslab selections was
much quicker than two different selections (eg: reading from a
non-contiguous dataset selection to a contiguous memory space). My
datasets can be quite large and much bigger than my available RAM so I was
not able to allocate a buffer big enough to be the same size as the
dataset. However I tried using mmap() to make an anonymous memory mapped
buffer the same size as the dataset on disk. This approach worked very
well for my read pattern and has the advantage that it works with current
versions of HDF5.

The disadvantages are that it is limited to 32 bit memory addresses on a 32
bit OS, and it could potentially increase disk reads/writes quite a bit.
My use case is roughly 4-5 times quicker than before. The upcoming
hyperslab optimizations should make this approach even faster.

Cheers,
Chris

···

On Thu, Nov 8, 2012 at 4:47 PM, Chris LeBlanc <crleblanc@gmail.com> wrote:

Hi Elena,

Thanks for the info, and sorry for the double-post. It's good to know
it's being looked at.

I did find that performance is excellent with non-contiguous hyperslab
selections, as long as the two selections are the same. It will be a bit
messy, but I think I can implement a read pattern of a large dataset by
reading smaller subsets of it instead of the whole thing at once. This
will let me use identical selections and hopefully improve performance.
The downside is that it will use more ram than it currently does, it'll
call H5Dread many more times, and I'll have to do more memcpy's.

Thanks,
Chris

On Thu, Nov 8, 2012 at 12:35 PM, Elena Pourmal <epourmal@hdfgroup.org>wrote:

Chris,

This is a known problem.

We entered the issue you reported into our database to make sure it is on
the radar for the HDF5 improvements. Unfortunately, making it to work will
require a substantial effort for a general case, but we probably will be
able to improve performance for some simple patterns like yours.

We plan to rework a hyperslab selection algorithm (going from O(n^2) to
O(1)) in HDF5 1.8.11. This should help, but there still will be the cases
when performance is bad due to the fact that we are "touching every pixel"
while building a general selection.

Bottom line: if you want good I/O, don't use a non-contigous selections
;-(

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 6, 2012, at 6:37 PM, Chris LeBlanc wrote:

Hi,

I think I've come across a performance issue with H5Dread when reading
non-contiguous hyperslab selections. The use case in my software is a bit
complicated, so instead I came up with a small example that shows the same
issue. Please let me know if I'm missing something here, it's possible
that a different approach could be much better.

In my example I write a 2D native int chunked dataset to an HDF5 file
(adapted from the h5_extend example, now writes a 229 MB file). I then
construct a hyperslab selection of the dataset and read it back using a
single call to H5Dread. When I use a stride of 1 (so all elements of the
selection are contiguous) the read is very fast. However, when I set the
stride to 2 the read time slows down significantly, on the order of 15
times slower.

The dataset has a chunk shape of 1000x500, and the 0th dimension is the
one being tested with a stride of 1 and 2. Is this a typical slowdown seen
with a stride of 2? If the chunksize is 1000, then a stride of 1 and 2
would still need to read the same amount of data, so I would expect similar
performance.

I've run the stride of 2 scenario under Valgrind (using the callgrind
tool) for profiling and it shows that 95% of the time is being spent in
H5S_select_iterate (I can share the callgrind output if it helps), which is
making this program CPU bound not I/O bound.

I'm using an up to date version of HDF5 trunk from checked out from
subversion. I looked at the callback H5D__chunk_io_init() used by
H5S_select_iterate(). I noticed that there are two different approaches
taken, one for the case where the shape of the memory space is the same as
the dataspace, and another if the shapes are different. The performance
drop I've noticed appears to be for the latter case.

Any ideas on how to optimize this function or otherwise increase the
performance of this use case?

Thanks,
Chris LeBlanc

--

Here is the example code. I wrote this mail earlier and included it as
an attachment and haven't seen it appear on the mailing list so I'm trying
again with the text inline:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * *
* Copyright by The HDF Group.
    *
* Copyright by the Board of Trustees of the University of Illinois.
    *
* All rights reserved.
   *
*
    *
* This file is part of HDF5. The full HDF5 copyright notice, including
    *
* terms governing use, modification, and redistribution, is contained in
   *
* the files COPYING and Copyright.html. COPYING can be found at the
root *
* of the source code distribution tree; Copyright.html can be found at
the *
* root level of an installed copy of the electronic HDF5 document set
and *
* is linked from the top-level documents page. It can also be found at
    *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have
     *
* access to either file, you may request a copy from help@hdfgroup.org.
    *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * */

/*
* This example how to work with extendible datasets. The dataset
* must be chunked in order to be extendible.
*
* It is used in the HDF5 Tutorial.
*/

// Modified example of h5_extend.c to show performance difference between
reading with a stride of 1 vs 2:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "hdf5.h"

#define FILE "extend.h5"
#define DATASETNAME "ExtendibleArray"
#define RANK 2

void write_file() {
    hid_t file; /* handles */
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hid_t cparms;

    hsize_t dims[2] = {20000, 3000}; /* dataset
dimensions at creation time */
    hsize_t maxdims[2] = {H5S_UNLIMITED, H5S_UNLIMITED};
    herr_t status;
    hsize_t chunk_dims[2] = {1000, 500};
    int *data = calloc(dims[0]*dims[1], sizeof(int));

    /* Variables used in reading data back */
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t i, j;
    int *datar = calloc(dims[0]*dims[1], sizeof(int));
    herr_t status_n;
    int rank, rank_chunk;

    /* Create the data space with unlimited dimensions. */
    dataspace = H5Screate_simple (RANK, dims, maxdims);

    /* Create a new file. If file exists its contents will be
overwritten. */
    file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk (cparms, RANK, chunk_dims);

    /* Create a new dataset within the file using cparms
       creation properties. */
    dataset = H5Dcreate2 (file, DATASETNAME, H5T_NATIVE_INT, dataspace,
                         H5P_DEFAULT, cparms, H5P_DEFAULT);

    status = H5Sclose (dataspace);

    /* Write data to dataset */
    status = H5Dwrite (dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                       H5P_DEFAULT, data);

    /* Close resources */
    status = H5Dclose (dataset);
    status = H5Fclose (file);
    status = H5Pclose (cparms);
    free(data);

}

void read_file(hsize_t dim1_stride, hsize_t dim2_stride) {

    /* Variables used in reading data back */
    hid_t file;
    hid_t dataspace, dataset;
    hid_t filespace, memspace;
    hsize_t chunk_dimsr[2];
    hsize_t dimsr[2];
    hsize_t memspace_dims[2];
    hsize_t i, j;
    int *datar;
    hsize_t mem_offsets[2] = {0, 0};
    hsize_t strides[2] = {dim1_stride, dim2_stride};
    hsize_t count[2];
    herr_t status_n;
    int rank_chunk;

    file = H5Fopen (FILE, H5F_ACC_RDONLY, H5P_DEFAULT);
    dataset = H5Dopen2 (file, DATASETNAME, H5P_DEFAULT);

    filespace = H5Dget_space (dataset);

    //rank = H5Sget_simple_extent_ndims (filespace);
    status_n = H5Sget_simple_extent_dims (filespace, dimsr, NULL);

    memspace_dims[0] = dimsr[0] / strides[0];
    memspace_dims[1] = dimsr[1];
    memspace = H5Screate_simple (RANK, memspace_dims, NULL);

    count[0] = dimsr[0] / strides[0];
    count[1] = dimsr[1];

    // core of this test: a hyperslab with varying stride:
    H5Sselect_hyperslab( filespace, H5S_SELECT_SET, mem_offsets, strides,
count, NULL );

    datar = calloc(memspace_dims[0]*memspace_dims[1], sizeof(int));

    printf("reading with stride = %d, memspace_dims: %d %d, count: %d
%d\n", (int) strides[0], (int) memspace_dims[0], (int) memspace_dims[1],
(int) count[0], (int) count[1]);

    time_t t1 = time(NULL);
    int status = H5Dread (dataset, H5T_NATIVE_INT, memspace, filespace,
                      H5P_DEFAULT, datar);

    time_t t2 = time(NULL);
    printf("done reading with stride = %d, time = %d (nearest sec)\n",
(int) strides[1], (int) (t2-t1) );

    status = H5Dclose (dataset);
    status = H5Sclose (filespace);
    status = H5Sclose (memspace);
    status = H5Fclose (file);
    free(datar);
}

int main (void)
{
    write_file();
    read_file(1, 1);
    read_file(2, 1);
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi,

I think I've come across a performance issue with H5Dread when reading
non-contiguous hyperslab selections. The use case in my software is a bit
complicated, so instead I came up with a small example that shows the same
issue. Please let me know if I'm missing something here, it's possible
that a different approach could be much better.

In my example I write a 2D native int chunked dataset to an HDF5 file
(adapted from the h5_extend example, now writes a 229 MB file). I then
construct a hyperslab selection of the dataset and read it back using a
single call to H5Dread. When I use a stride of 1 (so all elements of the
selection are contiguous) the read is very fast. However, when I set the
stride to 2 the read time slows down significantly, on the order of 15
times slower.

The dataset has a chunk shape of 1000x500, and the 0th dimension is the one
being tested with a stride of 1 and 2. Is this a typical slowdown seen
with a stride of 2? If the chunksize is 1000, then a stride of 1 and 2
would still need to read the same amount of data, so I would expect similar
performance.

I've run the stride of 2 scenario under Valgrind (using the callgrind tool)
for profiling and it shows that 95% of the time is being spent in
H5S_select_iterate (I can share the callgrind output if it helps), which is
making this program CPU bound and nowhere near I/O bound. Any ideas on how
to optimize this function or otherwise increase the performance of this use
case?

Thanks,
Chris LeBlanc

h5dread_hyperslab_benchmark1.c (5.34 KB)