RFC: new functions for setting the raw dataset chunk cache

nfortne2 · December 12, 2008, 8:52pm

A new Request for Comments (RFC) on new functions for setting individual chunk cache parameters for each dataset in HDF5 has just been published at http://www.hdfgroup.org/pubs/rfcs/RFC_chunk_cache_functions.pdf.

The HDF Group is currently soliciting feedback on this RFC. Community comments will be one of the factors considered by The HDF Group in making the final design and implementation decisions.

Comments may be sent to nfortne2@hdfgroup.org.

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ger_van_Diepen · December 15, 2008, 8:00am

It's very good that it will be made possible to change the chunk cache size without having to reopen the entire file. Also that it makes it possible to have a cache per data set.
I have a few questions/remarks though:

1. Why is it necessary to reopen the dataset to apply a new chunk cache size. I would think that the chunk cache size is a property of the dataset, so it should take effect when changing it.
The problem I see is that to determine a good cache size you first have to know the size and chunk size of the data set, so that requires it to be opened. Thereafter it must be closed and opened again to set the cache size.
It's not a major problem, but does not sound logical to me.

2. I think that traversal using some cursor size and access order is common practice. It would be very nice if HDF5 itself determines the required cache size giving a cursor size and access order. So a function like:
H5Dset_chunk_cache (hid_t dataset_id, size_t *cursor_size, size_t *axes, size_t naxes)
The last argument is not needed if you require that cursor_size and axes have a length equal to the dimensionaly of the data set. E.g. to traverse a 3-dim data set by vector in the Y-direction (in pseudo code):
H5Dset_chunk_cache (did, [1,ny,1], [1,0,2], 3);
Probably a maximum cache size argument should be added to avoid enormous cache sizes.

3. If the cache cannot be made big enough for optimal traversal, a preemption policy that keeps chunks in the cache would be nice. E.g. given a data set size of [100,100,100], chunk size of [10,10,10] and cursor size of [1,100,1] in order of Y,X,Z requires 100 chunks to be kept in the cache. If a cache of only 75 chunks can be made, it is much better to keep those 75 chunks as long as needed in the cache (and always read the other 25 chunks), than to do it in a round-robin way. I don't know if this can be accomodated.

4. Why does a hash of a chunk addresses need to be made? I assume that chunks are indexed as 0..N, so a block of N entries can do the job without hashing. For nowadays machines such a block of maybe a few MByte shouldn't be a problem.

5. It would be nice if HDF5 keeps cache statistics per dataset, so I can get the nr of cache misses and hits and the nr of actual reads and writes to see if the cache size is fine.

Cheers,
Ger

<nfortne2@hdfgroup.org> 12/12/08 9:52 PM >>>

A new Request for Comments (RFC) on new functions for setting
individual chunk cache parameters for each dataset in HDF5 has just
been published at
http://www.hdfgroup.org/pubs/rfcs/RFC_chunk_cache_functions.pdf\.

The HDF Group is currently soliciting feedback on this RFC.
Community comments will be one of the factors considered by The HDF
Group in making the final design and implementation decisions.

Comments may be sent to nfortne2@hdfgroup.org.

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

nfortne2 · December 15, 2008, 11:35pm

Ger,

Thank you for your feedback. The goal of these new features is to add new ways of setting the existing chunk caching parameters, and not to change the parameters themselves or the underlying caching algorithm. With that said, I will keep your suggestions in mind if we end up reworking the chunk caching algorithm soon. See below for more specific comments.

Quoting Ger van Diepen <diepen@astron.nl>:

It's very good that it will be made possible to change the chunk cache size without having to reopen the entire file. Also that it makes it possible to have a cache per data set.
I have a few questions/remarks though:

1. Why is it necessary to reopen the dataset to apply a new chunk cache size. I would think that the chunk cache size is a property of the dataset, so it should take effect when changing it.
The problem I see is that to determine a good cache size you first have to know the size and chunk size of the data set, so that requires it to be opened. Thereafter it must be closed and opened again to set the cache size.
It's not a major problem, but does not sound logical to me.

Internally it is not possible to change the cache parameters of an open dataset, so making this change would require reworking that code, which is beyond the scope of the changes proposed.

2. I think that traversal using some cursor size and access order is common practice. It would be very nice if HDF5 itself determines the required cache size giving a cursor size and access order. So a function like:
H5Dset_chunk_cache (hid_t dataset_id, size_t *cursor_size, size_t *axes, size_t naxes)
The last argument is not needed if you require that cursor_size and axes have a length equal to the dimensionaly of the data set. E.g. to traverse a 3-dim data set by vector in the Y-direction (in pseudo code):
H5Dset_chunk_cache (did, [1,ny,1], [1,0,2], 3);
Probably a maximum cache size argument should be added to avoid enormous cache sizes.

This is an interesting idea, but the optimal cache parameters also depend on the order in which cursor_size blocks are accessed, as well as the specific hardware/file driver/etc. being run. A truly adaptive chunk cache would again require changes to that algorithm.

3. If the cache cannot be made big enough for optimal traversal, a preemption policy that keeps chunks in the cache would be nice. E.g. given a data set size of [100,100,100], chunk size of [10,10,10] and cursor size of [1,100,1] in order of Y,X,Z requires 100 chunks to be kept in the cache. If a cache of only 75 chunks can be made, it is much better to keep those 75 chunks as long as needed in the cache (and always read the other 25 chunks), than to do it in a round-robin way. I don't know if this can be accomodated.

Again this is a good idea but beyond the scope of this work.

4. Why does a hash of a chunk addresses need to be made? I assume that chunks are indexed as 0..N, so a block of N entries can do the job without hashing. For nowadays machines such a block of maybe a few MByte shouldn't be a problem.

Again this would be beyond the scope of this work. Such an algorithm could also reduce the flexibility of the library in cases of very large datasets, when the user may want to set a small chunk size.

5. It would be nice if HDF5 keeps cache statistics per dataset, so I can get the nr of cache misses and hits and the nr of actual reads and writes to see if the cache size is fine.

This looks like an interesting idea. I will look into the possibility of implementing something like this. Thanks for your suggestions!

-Neil

···

Cheers,
Ger

<nfortne2@hdfgroup.org> 12/12/08 9:52 PM >>>

A new Request for Comments (RFC) on new functions for setting
individual chunk cache parameters for each dataset in HDF5 has just
been published at
http://www.hdfgroup.org/pubs/rfcs/RFC_chunk_cache_functions.pdf\.

The HDF Group is currently soliciting feedback on this RFC.
Community comments will be one of the factors considered by The HDF
Group in making the final design and implementation decisions.

Comments may be sent to nfortne2@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

RFC: new functions for setting the raw dataset chunk cache