Reset chunk cache

Hi, I am working on a benchmark repository to compare the performance of independent HDF5 reading libraries to the performance of the native C library (repository, web page).

I noticed the very high (too high) performance of the C library and suspect it to be related to the chunk cache which disturbs the benchmark which iterates very often to read the same dataset.

I want the chunk cache to be active during a single read operation since it is a performance relevant feature but I want it to be reset before the next loop iteration begins. So I want to simulate reading many different datasets although I have created only one to keep the file size small.

Is it enough to simply close the dataset and reopen it like this (the code below is for the thin C# wrapper HDF.PInvoke.1.10)?

_datasetId = H5D.open(_fileId, "chunked_btree2", _daplId);

var result = H5D.read(_datasetId, H5T.NATIVE_INT32, H5S.ALL, H5S.ALL, H5P.DEFAULT, _buffer);

if (H5I.is_valid(_datasetId) > 0)
    _ = H5D.close(_datasetId);

Or does this method not help to reset the chunk cache?

Thank you

Yes, closing the dataset ID (as long as it’s the last reference to the ID) will empty and close the associated chunk cache. I’m not sure if you’re considering disabling the cache for the benchmark, but if you are keep in mind that that might affect the performance, even with an initially empty cache. The size of the cache can also affect performance.

It’s also worth mentioning that cached pieces of metadata will not be evicted from the file’s metadata cache, so chunk index look ups will likely be faster on operations after the first, even if the dataset is closed between operations. To empty the metadata cache you’ll need to close the file.

1 Like

I suggest don’t do this. Instead, dynamically generate a realistic large test file with many data sets and random data. Then delete it when the benchmark is finished. This way, you avoid cacheing effects and other uncertain internal library behavior.

You will still be subject to external filesystem cacheing, but at least that effect would be equal across test scenarios, if you are doing comparisons.

Thanks! I do not want to disable the cache, only start withban empty one for each iterstion. I found a way to do this with my benchmark suite which allows me to do setup and cleaup work before and after each iteration. So for now I open and close the file for each iteration.

Thanks to you as well! For now it works find for me if I open the file before the iteration and close it afterwards (as written above). The prerequisite for this is that the CPU running the benchmark has an invariant TSC clock available which is the case for the machines I use to run the benchmark. Without that precise timer it would be problematic to do setup and cleanup work before and after each iteration because of unreliable time measurement.

Yes, caching is affecting all benchmarks but I mainly want to benchmark the efficient handling of internal H5 structures + compression algorithms and not the read operation itself so I think the current setup is OK for my purposes.