I fully agree with Elena that in general you cannot and should not set
a predefined chunk cache size.
However, I do believe that HDF5 can guess the chunk cache size based on
the access pattern, provided the user has not already set it. Usually
the access pattern is regular, so based on the hyperslab being accessed,
it can assume that the next accesses will be for the next similar
hyperslabs. Probably a hint parameter can be used to tell that the next
hyperslabs will be accessed. When the hyperslab shape changes, the user
probably starts another access pattern.
Of course, the system can never cater for fully random access, but I
believe that is not used very often. In such a case the user should
always set the cache size.
One can also think of some higher level functionality where the user
defines the cursor shape and access pattern making it possible to size
the cache automatically. Thereafter one can step through the dataset
using a simple next function. Maybe it also makes optimizations in HDF5
possible since the cursor shape and access pattern are known a priori
(for instance if the cursor shape is the chunk shape when finding, say,
the peak value in a dataset).
Cheers,
Ger
"David A. Schneider" <davidsch@slac.stanford.edu> 2/16/2016 9:15 PM
Thanks Elena,
After reading the comments at the end, I think I should try to write a
bunch of small 1MB chunks and see what the read performance is.
However
suppose this leads to 100 times as many chunks, I had the
understanding
that too many chunks degrades read performance in other ways, but
maybe
it will still be a win.
Those are good points about leaving the parameters for optimal
performance to the applications, but it would be nice if there was a
mechanism to allow the writing applications to be responsible for
this,
or at least provide hints that the hdf5 library could decide if it can
support. Then if I am producing a h5 file that a scientist will use
through a high level h5 interface, the scientist can communicate the
reading access pattern, and I can translate it into a chunk layout for
writing, and dataset chunk cache parameters for reading.
best,
David
Hi David and Filipe,
Chunking and compression is a powerful feature that boosts
performance and saves space, but if not used correctly (and as you
rightfully noted), leads to performance issues.
We did discuss the solution you proposed and voted against it. While
it is reasonable to increase current default chunk cache size from 1 MB
to ???, it would be unwise for the HDF5 library to use a chunk cache
size equal to a dataset chunk size. We decided to leave it to
applications to determine the appropriate chunk cache size and
strategies (for example, use H5Pset_chunk_cache instead of H5Pset_cache,
or disable chunk cache completely!)
Here are several reasons:
1. Chunk size can be pretty big because it worked well when data was
written, but it may not work well for reading applications. An HDF5
application will use a lot of memory when working with such files,
especially, if many files and datasets are open. We see this scenario
very often when users work with the collections of the HDF5 files (for
example, NPP satellite data; the attached paper discusses one of those
use cases).
2. Making chunk cache size the same as chunk size will only solve the
performance problem when data that is written/or read belongs to one
chunk. This is not usually the case. Suppose you have a row that spans
among several chunks. When application reads by one row at a time, it
will not only use a lot of memory because chunk cache is now big, but
there will be the same performance problem as you described in your
email: the same chunk will be read and discarded many times.
The way to deal with the performance problem is to adjust access
pattern or have chunk cache that contains as many chunks as possible for
the I/O operation. The HDF5 library doesn’t know this a priori and that
is why we left it to applications. At this point we don’t see how we can
help except educating our users.
I am attaching a white paper that will be posted on our Website; see
section 4. Comments are highly appreciated.
Thank you!
Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
···
On 02/14/16 16:55, Elena Pourmal wrote:
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5