Global cache size

Hello HDF5 developers!

Currently, HDF5 library presents two ways to control cache size when accessing datasets:

* H5Pset_cache / H5Pget_cache
* H5Pset_chunk_cache / H5Pget_chunk_cache

The former control the default cache buffer size for all datasets, while the latter allow to fine-tune the cache buffer size on per-dataset basis.

It works nicely in many cases. However working with bigger, multi-dataset HDF5 files reveals a considerable flow. Cache is way to trade memory for speed. How much memory one would trade naturally depends on the total memory available, i.e. memory is (a scarce) global resource. Thus, more often than not it is desirable to set *global* cache size for *all* HDF5 datasets, regardless of number of datasets (and even files) open.

E.g, I'd like to be able to say: "Use no more than 1GB of memory for cache" instead of "Use no more than 50MB of memory for caching each dataset". The latter is not as useful as the former, as number of datasets may vary greatly.

Currently there seems no way to impose global cache size limit. Would it be hard to implement such a feature, in one of future versions?

Thank you for your work,
Andrey Paramonov

···

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Hi Andrey,

  just to mention, there are more buffers and caches involved for HDF5 datasets, for instance the Sieve buffer:

https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetSieveBufSize

It was this one that gave me memory headaches at some point, though it seems solved in the current HDF5 version.

A global cache value would make sense and be convenient, possibly combined with a setting how much to prioritize each of the individual cache sizes.

        Werner

···

On 13.05.2015 12:01, Андрей Парамонов wrote:

Hello HDF5 developers!

Currently, HDF5 library presents two ways to control cache size when accessing datasets:

* H5Pset_cache / H5Pget_cache
* H5Pset_chunk_cache / H5Pget_chunk_cache

The former control the default cache buffer size for all datasets, while the latter allow to fine-tune the cache buffer size on per-dataset basis.

It works nicely in many cases. However working with bigger, multi-dataset HDF5 files reveals a considerable flow. Cache is way to trade memory for speed. How much memory one would trade naturally depends on the total memory available, i.e. memory is (a scarce) global resource. Thus, more often than not it is desirable to set *global* cache size for *all* HDF5 datasets, regardless of number of datasets (and even files) open.

E.g, I'd like to be able to say: "Use no more than 1GB of memory for cache" instead of "Use no more than 50MB of memory for caching each dataset". The latter is not as useful as the former, as number of datasets may vary greatly.

Currently there seems no way to impose global cache size limit. Would it be hard to implement such a feature, in one of future versions?

Thank you for your work,
Andrey Paramonov

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Hi Andrey and Werner,

Thank you for your input. Global cache is one of the improvements we are considering to the library. We have more and more applications that open and process multiple files and dataset. You are absolutely correct that memory becomes an issue in this case. Global cache is one of the improvements we are considering to the library to address the problem.

There is one thing to remember: chunk cache is important when a chunk is compressed and is accessed multiple time. If this is not the case (for example, application always read a subset that contains the whole chunks), one can disable chunk cache completely to reduce application memory footprint.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On May 13, 2015, at 8:23 AM, Werner Benger <werner@cct.lsu.edu> wrote:

Hi Andrey,

just to mention, there are more buffers and caches involved for HDF5 datasets, for instance the Sieve buffer:

https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetSieveBufSize

It was this one that gave me memory headaches at some point, though it seems solved in the current HDF5 version.

A global cache value would make sense and be convenient, possibly combined with a setting how much to prioritize each of the individual cache sizes.

      Werner

On 13.05.2015 12:01, Андрей Парамонов wrote:

Hello HDF5 developers!

Currently, HDF5 library presents two ways to control cache size when accessing datasets:

* H5Pset_cache / H5Pget_cache
* H5Pset_chunk_cache / H5Pget_chunk_cache

The former control the default cache buffer size for all datasets, while the latter allow to fine-tune the cache buffer size on per-dataset basis.

It works nicely in many cases. However working with bigger, multi-dataset HDF5 files reveals a considerable flow. Cache is way to trade memory for speed. How much memory one would trade naturally depends on the total memory available, i.e. memory is (a scarce) global resource. Thus, more often than not it is desirable to set *global* cache size for *all* HDF5 datasets, regardless of number of datasets (and even files) open.

E.g, I'd like to be able to say: "Use no more than 1GB of memory for cache" instead of "Use no more than 50MB of memory for caching each dataset". The latter is not as useful as the former, as number of datasets may vary greatly.

Currently there seems no way to impose global cache size limit. Would it be hard to implement such a feature, in one of future versions?

Thank you for your work,
Andrey Paramonov

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

17.05.2015 19:47, Elena Pourmal пишет:

Thank you for your input. Global cache is one of the improvements we are considering to the library. We have more and more applications that open and process multiple files and dataset. You are absolutely correct that memory becomes an issue in this case. Global cache is one of the improvements we are considering to the library to address the problem.

That's good news!

Interface-wise, it seems that a pair of additional routines H5Pset_global_chunk_cache / H5Pget_global_chunk_cache with arguments as in H5Pset_chunk_cache / H5Pget_chunk_cache should be introduced. If H5Pset_global_chunk_cache was called, both limitations (for single dataset and for total cache size) should become in effect.

The intricate question is how to handle rdcc_w0 parameter (different datasets may have different values).

Another implementation strategy might be to introduce "dataset groups" which would share the same cache. This mechanic could be made exclusive of H5Pset_chunk_cache / H5Pget_chunk_cache, so dataset-specific rdcc_nslots, rdcc_nbytes, rdcc_w0 lose their effect if cache group is enabled (during dataset open). It might bring more flexibility, but might be harder to document and use.

There is one thing to remember: chunk cache is important when a chunk is compressed and is accessed multiple time. If this is not the case (for example, application always read a subset that contains the whole chunks), one can disable chunk cache completely to reduce application memory footprint.

This is clear. My typical work-flow implies multiple access to the same chunks.

Thank you for your work of HDF5 library,
and best wishes,
Andrey Paramonov

···

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.