Chuck cache size proposal

Hi,

I think the current default chuck cache size behaviour of HDF5 is
inadequate for the type of data it is typically used with.
This is specially a problem when using compression as most users/readers
will not set any chunk cache and cause endless decompression calls. I think
this hinders the use of compression and other kinds of filters.

I would like to propose that when a dataset is opened its chuck cache be
set to the largest of the file chunk cache (the one set with H5Pset_cache)
and the dataset chunk size.

I think this would be beneficial for the vast majority of the workloads.

Cheers,
Filipe

I think something like that is a great idea!

David

···

On 02/10/16 07:17, Filipe Maia wrote:

Hi,

I think the current default chuck cache size behaviour of HDF5 is inadequate for the type of data it is typically used with.
This is specially a problem when using compression as most users/readers will not set any chunk cache and cause endless decompression calls. I think this hinders the use of compression and other kinds of filters.

I would like to propose that when a dataset is opened its chuck cache be set to the largest of the file chunk cache (the one set with H5Pset_cache) and the dataset chunk size.

I think this would be beneficial for the vast majority of the workloads.

Cheers,
Filipe

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Hi David and Filipe,

Chunking and compression is a powerful feature that boosts performance and saves space, but if not used correctly (and as you rightfully noted), leads to performance issues.

We did discuss the solution you proposed and voted against it. While it is reasonable to increase current default chunk cache size from 1 MB to ???, it would be unwise for the HDF5 library to use a chunk cache size equal to a dataset chunk size. We decided to leave it to applications to determine the appropriate chunk cache size and strategies (for example, use H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache completely!)

Here are several reasons:

1. Chunk size can be pretty big because it worked well when data was written, but it may not work well for reading applications. An HDF5 application will use a lot of memory when working with such files, especially, if many files and datasets are open. We see this scenario very often when users work with the collections of the HDF5 files (for example, NPP satellite data; the attached paper discusses one of those use cases).

2. Making chunk cache size the same as chunk size will only solve the performance problem when data that is written/or read belongs to one chunk. This is not usually the case. Suppose you have a row that spans among several chunks. When application reads by one row at a time, it will not only use a lot of memory because chunk cache is now big, but there will be the same performance problem as you described in your email: the same chunk will be read and discarded many times.

The way to deal with the performance problem is to adjust access pattern or have chunk cache that contains as many chunks as possible for the I/O operation. The HDF5 library doesn’t know this a priori and that is why we left it to applications. At this point we don’t see how we can help except educating our users.

I am attaching a white paper that will be posted on our Website; see section 4. Comments are highly appreciated.

Thank you!

Elena

TechNote-HDF5-ImprovingIOPerformanceCompressedDatasets.pdf (674 KB)

ATT00001.txt (1.34 KB)

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks Elena,

After reading the comments at the end, I think I should try to write a bunch of small 1MB chunks and see what the read performance is. However suppose this leads to 100 times as many chunks, I had the understanding that too many chunks degrades read performance in other ways, but maybe it will still be a win.

Those are good points about leaving the parameters for optimal performance to the applications, but it would be nice if there was a mechanism to allow the writing applications to be responsible for this, or at least provide hints that the hdf5 library could decide if it can support. Then if I am producing a h5 file that a scientist will use through a high level h5 interface, the scientist can communicate the reading access pattern, and I can translate it into a chunk layout for writing, and dataset chunk cache parameters for reading.

best,

David

···

On 02/14/16 16:55, Elena Pourmal wrote:

Hi David and Filipe,

Chunking and compression is a powerful feature that boosts performance and saves space, but if not used correctly (and as you rightfully noted), leads to performance issues.

We did discuss the solution you proposed and voted against it. While it is reasonable to increase current default chunk cache size from 1 MB to ???, it would be unwise for the HDF5 library to use a chunk cache size equal to a dataset chunk size. We decided to leave it to applications to determine the appropriate chunk cache size and strategies (for example, use H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache completely!)

Here are several reasons:

1. Chunk size can be pretty big because it worked well when data was written, but it may not work well for reading applications. An HDF5 application will use a lot of memory when working with such files, especially, if many files and datasets are open. We see this scenario very often when users work with the collections of the HDF5 files (for example, NPP satellite data; the attached paper discusses one of those use cases).

2. Making chunk cache size the same as chunk size will only solve the performance problem when data that is written/or read belongs to one chunk. This is not usually the case. Suppose you have a row that spans among several chunks. When application reads by one row at a time, it will not only use a lot of memory because chunk cache is now big, but there will be the same performance problem as you described in your email: the same chunk will be read and discarded many times.

The way to deal with the performance problem is to adjust access pattern or have chunk cache that contains as many chunks as possible for the I/O operation. The HDF5 library doesn�t know this a priori and that is why we left it to applications. At this point we don�t see how we can help except educating our users.

I am attaching a white paper that will be posted on our Website; see section 4. Comments are highly appreciated.

Thank you!

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

I fully agree with Elena that in general you cannot and should not set
a predefined chunk cache size.
However, I do believe that HDF5 can guess the chunk cache size based on
the access pattern, provided the user has not already set it. Usually
the access pattern is regular, so based on the hyperslab being accessed,
it can assume that the next accesses will be for the next similar
hyperslabs. Probably a hint parameter can be used to tell that the next
hyperslabs will be accessed. When the hyperslab shape changes, the user
probably starts another access pattern.
Of course, the system can never cater for fully random access, but I
believe that is not used very often. In such a case the user should
always set the cache size.

One can also think of some higher level functionality where the user
defines the cursor shape and access pattern making it possible to size
the cache automatically. Thereafter one can step through the dataset
using a simple next function. Maybe it also makes optimizations in HDF5
possible since the cursor shape and access pattern are known a priori
(for instance if the cursor shape is the chunk shape when finding, say,
the peak value in a dataset).

Cheers,
Ger

"David A. Schneider" <davidsch@slac.stanford.edu> 2/16/2016 9:15 PM

Thanks Elena,

After reading the comments at the end, I think I should try to write a
bunch of small 1MB chunks and see what the read performance is.
However
suppose this leads to 100 times as many chunks, I had the
understanding
that too many chunks degrades read performance in other ways, but
maybe
it will still be a win.

Those are good points about leaving the parameters for optimal
performance to the applications, but it would be nice if there was a
mechanism to allow the writing applications to be responsible for
this,
or at least provide hints that the hdf5 library could decide if it can
support. Then if I am producing a h5 file that a scientist will use
through a high level h5 interface, the scientist can communicate the
reading access pattern, and I can translate it into a chunk layout for
writing, and dataset chunk cache parameters for reading.

best,

David

Hi David and Filipe,

Chunking and compression is a powerful feature that boosts

performance and saves space, but if not used correctly (and as you
rightfully noted), leads to performance issues.

We did discuss the solution you proposed and voted against it. While

it is reasonable to increase current default chunk cache size from 1 MB
to ???, it would be unwise for the HDF5 library to use a chunk cache
size equal to a dataset chunk size. We decided to leave it to
applications to determine the appropriate chunk cache size and
strategies (for example, use H5Pset_chunk_cache instead of H5Pset_cache,
or disable chunk cache completely!)

Here are several reasons:

1. Chunk size can be pretty big because it worked well when data was

written, but it may not work well for reading applications. An HDF5
application will use a lot of memory when working with such files,
especially, if many files and datasets are open. We see this scenario
very often when users work with the collections of the HDF5 files (for
example, NPP satellite data; the attached paper discusses one of those
use cases).

2. Making chunk cache size the same as chunk size will only solve the

performance problem when data that is written/or read belongs to one
chunk. This is not usually the case. Suppose you have a row that spans
among several chunks. When application reads by one row at a time, it
will not only use a lot of memory because chunk cache is now big, but
there will be the same performance problem as you described in your
email: the same chunk will be read and discarded many times.

The way to deal with the performance problem is to adjust access

pattern or have chunk cache that contains as many chunks as possible for
the I/O operation. The HDF5 library doesn’t know this a priori and that
is why we left it to applications. At this point we don’t see how we can
help except educating our users.

I am attaching a white paper that will be posted on our Website; see

section 4. Comments are highly appreciated.

Thank you!

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

···

On 02/14/16 16:55, Elena Pourmal wrote:

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Random access is common in some use cases where statistics are generated
for a random sample of a dataset.

···

On Wed, Feb 17, 2016 at 3:32 AM, Ger van Diepen <diepen@astron.nl> wrote:

I fully agree with Elena that in general you cannot and should not set a
predefined chunk cache size.

However, I do believe that HDF5 can guess the chunk cache size based on
the access pattern, provided the user has not already set it. Usually the
access pattern is regular, so based on the hyperslab being accessed, it can
assume that the next accesses will be for the next similar hyperslabs.
Probably a hint parameter can be used to tell that the next hyperslabs will
be accessed. When the hyperslab shape changes, the user probably starts
another access pattern.

Of course, the system can never cater for fully random access, but I
believe that is not used very often. In such a case the user should always
set the cache size.

One can also think of some higher level functionality where the user
defines the cursor shape and access pattern making it possible to size the
cache automatically. Thereafter one can step through the dataset using a
simple next function. Maybe it also makes optimizations in HDF5 possible
since the cursor shape and access pattern are known a priori (for instance
if the cursor shape is the chunk shape when finding, say, the peak value in
a dataset).

Cheers,

Ger

>>> "David A. Schneider" <davidsch@slac.stanford.edu> 2/16/2016 9:15 PM
>>>

Thanks Elena,

After reading the comments at the end, I think I should try to write a
bunch of small 1MB chunks and see what the read performance is. However
suppose this leads to 100 times as many chunks, I had the understanding
that too many chunks degrades read performance in other ways, but maybe
it will still be a win.

Those are good points about leaving the parameters for optimal
performance to the applications, but it would be nice if there was a
mechanism to allow the writing applications to be responsible for this,
or at least provide hints that the hdf5 library could decide if it can
support. Then if I am producing a h5 file that a scientist will use
through a high level h5 interface, the scientist can communicate the
reading access pattern, and I can translate it into a chunk layout for
writing, and dataset chunk cache parameters for reading.

best,

David

On 02/14/16 16:55, Elena Pourmal wrote:
> Hi David and Filipe,
>
> Chunking and compression is a powerful feature that boosts performance
and saves space, but if not used correctly (and as you rightfully noted),
leads to performance issues.
>
> We did discuss the solution you proposed and voted against it. While it
is reasonable to increase current default chunk cache size from 1 MB to
???, it would be unwise for the HDF5 library to use a chunk cache size
equal to a dataset chunk size. We decided to leave it to applications to
determine the appropriate chunk cache size and strategies (for example, use
H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache
completely!)
>
>
> Here are several reasons:
>
> 1. Chunk size can be pretty big because it worked well when data was
written, but it may not work well for reading applications. An HDF5
application will use a lot of memory when working with such files,
especially, if many files and datasets are open. We see this scenario very
often when users work with the collections of the HDF5 files (for example,
NPP satellite data; the attached paper discusses one of those use cases).
>
> 2. Making chunk cache size the same as chunk size will only solve the
performance problem when data that is written/or read belongs to one chunk.
This is not usually the case. Suppose you have a row that spans among
several chunks. When application reads by one row at a time, it will not
only use a lot of memory because chunk cache is now big, but there will be
the same performance problem as you described in your email: the same chunk
will be read and discarded many times.
>
> The way to deal with the performance problem is to adjust access pattern
or have chunk cache that contains as many chunks as possible for the I/O
operation. The HDF5 library doesn’t know this a priori and that is why we
left it to applications. At this point we don’t see how we can help except
educating our users.
>
> I am attaching a white paper that will be posted on our Website; see
section 4. Comments are highly appreciated.
>
> Thank you!
>
> Elena
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Elena Pourmal The HDF Group http://hdfgroup.org
> 1800 So. Oak St., Suite 203, Champaign IL 61820
> 217.531.6112
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

I also agree that the library should not enforce any set chunk size, but
that was not ever in question. The issue is finding the best chunk cache
size when the user has not defined one.
It seems we all agree that the current value of 1MB is outdated.
I also understand that we need to weigh the concerns of using too much
memory.

Taking the hyperslab size into account, together with the chunk size, is a
good idea and it would give us more valuable information to calculate a
better value for chunk cache (e.g. the maximum of chunk size and hyperslab
size for each dimension).

Another possibility would be to give compressions filters some more
information about the dataspace and allow them to set the chunk cache, but
that is a discussion for another thread.

The scenario of multiple user reads per chunk is not uncommon. For example
my datasets have many images and to be able to efficiently compress it I
need to chunk it with multiple images per chunk (as the images share common
features). The user usually looks at one image at a time resulting in
multiple reads per chunk. I don't think such situations are atypical.

Cheers,
Filipe

···

On 17 February 2016 at 08:32, Ger van Diepen <diepen@astron.nl <javascript:_e(%7B%7D,'cvml','diepen@astron.nl');>> wrote:

I fully agree with Elena that in general you cannot and should not set a
predefined chunk cache size.

However, I do believe that HDF5 can guess the chunk cache size based on
the access pattern, provided the user has not already set it. Usually the
access pattern is regular, so based on the hyperslab being accessed, it can
assume that the next accesses will be for the next similar hyperslabs.
Probably a hint parameter can be used to tell that the next hyperslabs will
be accessed. When the hyperslab shape changes, the user probably starts
another access pattern.

Of course, the system can never cater for fully random access, but I
believe that is not used very often. In such a case the user should always
set the cache size.

One can also think of some higher level functionality where the user
defines the cursor shape and access pattern making it possible to size the
cache automatically. Thereafter one can step through the dataset using a
simple next function. Maybe it also makes optimizations in HDF5 possible
since the cursor shape and access pattern are known a priori (for instance
if the cursor shape is the chunk shape when finding, say, the peak value in
a dataset).

Cheers,

Ger

>>> "David A. Schneider" <davidsch@slac.stanford.edu
<javascript:_e(%7B%7D,'cvml','davidsch@slac.stanford.edu');>> 2/16/2016
9:15 PM >>>

Thanks Elena,

After reading the comments at the end, I think I should try to write a
bunch of small 1MB chunks and see what the read performance is. However
suppose this leads to 100 times as many chunks, I had the understanding
that too many chunks degrades read performance in other ways, but maybe
it will still be a win.

Those are good points about leaving the parameters for optimal
performance to the applications, but it would be nice if there was a
mechanism to allow the writing applications to be responsible for this,
or at least provide hints that the hdf5 library could decide if it can
support. Then if I am producing a h5 file that a scientist will use
through a high level h5 interface, the scientist can communicate the
reading access pattern, and I can translate it into a chunk layout for
writing, and dataset chunk cache parameters for reading.

best,

David

On 02/14/16 16:55, Elena Pourmal wrote:
> Hi David and Filipe,
>
> Chunking and compression is a powerful feature that boosts performance
and saves space, but if not used correctly (and as you rightfully noted),
leads to performance issues.
>
> We did discuss the solution you proposed and voted against it. While it
is reasonable to increase current default chunk cache size from 1 MB to
???, it would be unwise for the HDF5 library to use a chunk cache size
equal to a dataset chunk size. We decided to leave it to applications to
determine the appropriate chunk cache size and strategies (for example, use
H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache
completely!)
>
>
> Here are several reasons:
>
> 1. Chunk size can be pretty big because it worked well when data was
written, but it may not work well for reading applications. An HDF5
application will use a lot of memory when working with such files,
especially, if many files and datasets are open. We see this scenario very
often when users work with the collections of the HDF5 files (for example,
NPP satellite data; the attached paper discusses one of those use cases).
>
> 2. Making chunk cache size the same as chunk size will only solve the
performance problem when data that is written/or read belongs to one chunk.
This is not usually the case. Suppose you have a row that spans among
several chunks. When application reads by one row at a time, it will not
only use a lot of memory because chunk cache is now big, but there will be
the same performance problem as you described in your email: the same chunk
will be read and discarded many times.
>
> The way to deal with the performance problem is to adjust access pattern
or have chunk cache that contains as many chunks as possible for the I/O
operation. The HDF5 library doesn’t know this a priori and that is why we
left it to applications. At this point we don’t see how we can help except
educating our users.
>
> I am attaching a white paper that will be posted on our Website; see
section 4. Comments are highly appreciated.
>
> Thank you!
>
> Elena
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Elena Pourmal The HDF Group http://hdfgroup.org
> 1800 So. Oak St., Suite 203, Champaign IL 61820
> 217.531.6112
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
<javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
<javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
<javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

If Hdf5 reads in a big chunk and uncompresses it, it seems like a waste to free that memory before someone else needs it - but I guess if you hold onto it and the user needs a lot of memory you'll block them. I wonder if there is a way to hook into requests for memory and free that uncompressed chunk if someone else needs it, I googled and found __malloc_hook for gnu C++, but I doubt that is robust or hooks into all system requests for memory - looks like it is just for debugging your gnu program.

An alternative to developing smart automatic chunk size defaults to find an alternative to the current 1MB might be setting the value through an option for configure when installing hdf5, and through environment variables.
We apply a patch to our hdf5 installation to increase it to 32MB, if this is useful to anyone, I put the patch below.

For us, the data a scientist looks at will be a stack of images, most of the time they go though the stack one at a time, loading the whole image, but I've also heard of analysis that is effectively random access, and analysis that works with a region of interest.

Here's our patch:

--- src/H5Pfapl.c.orig 2015-05-28 09:01:47.000000000 -0700
+++ src/H5Pfapl.c 2015-06-05 17:16:43.075228397 -0700
@@ -64,7 +64,7 @@
  #define H5F_ACS_DATA_CACHE_NUM_SLOTS_DEF 521
  /* Definition for size of raw data chunk cache(bytes) */
  #define H5F_ACS_DATA_CACHE_BYTE_SIZE_SIZE sizeof(size_t)
-#define H5F_ACS_DATA_CACHE_BYTE_SIZE_DEF (1024*1024)
+#define H5F_ACS_DATA_CACHE_BYTE_SIZE_DEF (32*1024*1024)
  /* Definition for preemption read chunks first */
  #define H5F_ACS_PREEMPT_READ_CHUNKS_SIZE sizeof(double)
  #define H5F_ACS_PREEMPT_READ_CHUNKS_DEF 0.75f

best,

David

···

On 02/17/16 02:36, Filipe Maia wrote:

I also agree that the library should not enforce any set chunk size, but that was not ever in question. The issue is finding the best chunk cache size when the user has not defined one.
It seems we all agree that the current value of 1MB is outdated. I also understand that we need to weigh the concerns of using too much memory.

Taking the hyperslab size into account, together with the chunk size, is a good idea and it would give us more valuable information to calculate a better value for chunk cache (e.g. the maximum of chunk size and hyperslab size for each dimension).

Another possibility would be to give compressions filters some more information about the dataspace and allow them to set the chunk cache, but that is a discussion for another thread.

The scenario of multiple user reads per chunk is not uncommon. For example my datasets have many images and to be able to efficiently compress it I need to chunk it with multiple images per chunk (as the images share common features). The user usually looks at one image at a time resulting in multiple reads per chunk. I don't think such situations are atypical.

Cheers,
Filipe

On 17 February 2016 at 08:32, Ger van Diepen <diepen@astron.nl > <javascript:_e(%7B%7D,'cvml','diepen@astron.nl');>> wrote:

    I fully agree with Elena that in general you cannot and should not
    set a predefined chunk cache size.

    However, I do believe that HDF5 can guess the chunk cache size
    based on the access pattern, provided the user has not already set
    it. Usually the access pattern is regular, so based on the
    hyperslab being accessed, it can assume that the next accesses
    will be for the next similar hyperslabs. Probably a hint parameter
    can be used to tell that the next hyperslabs will be accessed.
    When the hyperslab shape changes, the user probably starts another
    access pattern.

    Of course, the system can never cater for fully random access, but
    I believe that is not used very often. In such a case the user
    should always set the cache size.

    One can also think of some higher level functionality where the
    user defines the cursor shape and access pattern making it
    possible to size the cache automatically. Thereafter one can step
    through the dataset using a simple next function. Maybe it also
    makes optimizations in HDF5 possible since the cursor shape and
    access pattern are known a priori (for instance if the cursor
    shape is the chunk shape when finding, say, the peak value in a
    dataset).

    Cheers,

    Ger

    >>> "David A. Schneider" <davidsch@slac.stanford.edu
    <javascript:_e(%7B%7D,'cvml','davidsch@slac.stanford.edu');>>
    2/16/2016 9:15 PM >>>

    Thanks Elena,

    After reading the comments at the end, I think I should try to write a
    bunch of small 1MB chunks and see what the read performance is.
    However
    suppose this leads to 100 times as many chunks, I had the
    understanding
    that too many chunks degrades read performance in other ways, but
    maybe
    it will still be a win.

    Those are good points about leaving the parameters for optimal
    performance to the applications, but it would be nice if there was a
    mechanism to allow the writing applications to be responsible for
    this,
    or at least provide hints that the hdf5 library could decide if it can
    support. Then if I am producing a h5 file that a scientist will use
    through a high level h5 interface, the scientist can communicate the
    reading access pattern, and I can translate it into a chunk layout for
    writing, and dataset chunk cache parameters for reading.

    best,

    David

    On 02/14/16 16:55, Elena Pourmal wrote:
    > Hi David and Filipe,
    >
    > Chunking and compression is a powerful feature that boosts
    performance and saves space, but if not used correctly (and as you
    rightfully noted), leads to performance issues.
    >
    > We did discuss the solution you proposed and voted against it.
    While it is reasonable to increase current default chunk cache
    size from 1 MB to ???, it would be unwise for the HDF5 library to
    use a chunk cache size equal to a dataset chunk size. We decided
    to leave it to applications to determine the appropriate chunk
    cache size and strategies (for example, use H5Pset_chunk_cache
    instead of H5Pset_cache, or disable chunk cache completely!)
    >
    > Here are several reasons:
    >
    > 1. Chunk size can be pretty big because it worked well when data
    was written, but it may not work well for reading applications. An
    HDF5 application will use a lot of memory when working with such
    files, especially, if many files and datasets are open. We see
    this scenario very often when users work with the collections of
    the HDF5 files (for example, NPP satellite data; the attached
    paper discusses one of those use cases).
    >
    > 2. Making chunk cache size the same as chunk size will only
    solve the performance problem when data that is written/or read
    belongs to one chunk. This is not usually the case. Suppose you
    have a row that spans among several chunks. When application reads
    by one row at a time, it will not only use a lot of memory because
    chunk cache is now big, but there will be the same performance
    problem as you described in your email: the same chunk will be
    read and discarded many times.
    >
    > The way to deal with the performance problem is to adjust access
    pattern or have chunk cache that contains as many chunks as
    possible for the I/O operation. The HDF5 library doesn�t know this
    a priori and that is why we left it to applications. At this point
    we don�t see how we can help except educating our users.
    >
    > I am attaching a white paper that will be posted on our Website;
    see section 4. Comments are highly appreciated.
    >
    > Thank you!
    >
    > Elena
    > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    > Elena Pourmal The HDF Group http://hdfgroup.org
    > 1800 So. Oak St., Suite 203, Champaign IL 61820
    > 217.531.6112 <tel:217.531.6112>
    > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    >
    > _______________________________________________
    > Hdf-forum is for HDF software users discussion.
    > Hdf-forum@lists.hdfgroup.org
    <javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
    >
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    > Twitter: https://twitter.com/hdf5

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    <javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    <javascript:_e(%7B%7D,'cvml','Hdf-forum@lists.hdfgroup.org');>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5