Chunk cache size auto-adjustment

Hi,

in our organization the data that we need to store in HDF5
typically varies in size very much, from small objects 10-20
bytes in size to very large several-MB per object (typically
images). The chunks that we create for large objects tend to
be large as well and they exceed the standard HDF5 setting for
chunk cache size (1MB). That of course means that with the
default settings large chunks are never cached in memory.

We cannot reduce our chunk size as it will lead to way too
many chunks which causes other sorts of problems. Standard
solution is of course is to set chunk cache size when reading
data to a larger value. This does not work too well for us
because we have a multitude of tools for HDF5 access -
C++, Matlab, h5py, IDL, etc.; and too many users that need
some education about how to change cache size settings in each
of those tools (which is not always trivial). The only
reasonable solution that I found for now is to patch HDF5
sources to increase default cache size value from 1MB to 32MB.
That has is own troubles because not everyone uses our patched
HDF5 library of course.

I think it would be beneficial in cases like ours to have an
adaptive algorithm in HDF5 by default which can fit larger
chunks in cache. Would it be possible to add something like
this to future HDF5 versions? I don't think it has to be
complex, simplest thing would probably be "make sure that at
least one chunk fits in cache unless user provides explicit
cache size for a dataset". If help is needed I could try to
produce a patch which does that (will need some time to
understand the code of course).

Thanks,
Andy

Andrei,

I created an issue in our JIRA database.

Thank you!

Elena

ยทยทยท

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Oct 24, 2013, at 1:55 PM, "Salnikov, Andrei A." <salnikov@slac.stanford.edu> wrote:

Hi,

in our organization the data that we need to store in HDF5
typically varies in size very much, from small objects 10-20
bytes in size to very large several-MB per object (typically
images). The chunks that we create for large objects tend to
be large as well and they exceed the standard HDF5 setting for
chunk cache size (1MB). That of course means that with the
default settings large chunks are never cached in memory.

We cannot reduce our chunk size as it will lead to way too
many chunks which causes other sorts of problems. Standard
solution is of course is to set chunk cache size when reading
data to a larger value. This does not work too well for us
because we have a multitude of tools for HDF5 access -
C++, Matlab, h5py, IDL, etc.; and too many users that need
some education about how to change cache size settings in each
of those tools (which is not always trivial). The only
reasonable solution that I found for now is to patch HDF5
sources to increase default cache size value from 1MB to 32MB.
That has is own troubles because not everyone uses our patched
HDF5 library of course.

I think it would be beneficial in cases like ours to have an
adaptive algorithm in HDF5 by default which can fit larger
chunks in cache. Would it be possible to add something like
this to future HDF5 versions? I don't think it has to be
complex, simplest thing would probably be "make sure that at
least one chunk fits in cache unless user provides explicit
cache size for a dataset". If help is needed I could try to
produce a patch which does that (will need some time to
understand the code of course).

Thanks,
Andy

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org