Environment variable for chunk cache size?


#1

I’ve come across performance problems in a case where chunks are much bigger than the default chunk cache size. The default of 1 MB cache per dataset seems extremely small now that even laptops have multiple GB of RAM, and HPC cluster nodes can have hundreds of GB.

I found a thread about this from a couple of years ago. It looks like having the library try to guess a good cache size is not an option, and I’m OK with that: I’d rather have simple, predictable behaviour even if it is wrong in some cases.

However, I’d like to be able to experiment with different cache sizes without having to recompile the software in question. So I’d propose adding an environment variable to be used like this:

HDF5_CHUNK_CACHE_SIZE=128M

This would override the default, but it could be overridden if the application called H5Pset_cache or H5Pset_chunk_cache. Suffixes K, M or G would multiply the number by the relevant power of 1024 to get a size in bytes.


#2

Hi Thomas!

03.07.2018 18:49, Thomas Kluyver пишет:

I’ve come across performance problems in a case where chunks are much
bigger than the default chunk cache size. The default of 1 MB cache per
dataset seems extremely small now that even laptops have multiple GB of
RAM, and HPC cluster nodes can have hundreds of GB.

I found a thread about this from a couple of years ago
https://forum.hdfgroup.org/t/chuck-cache-size-proposal/3684. It looks
like having the library try to guess a good cache size is not an option,
and I’m OK with that: I’d rather have simple, predictable behaviour even
if it is wrong in some cases.

Please see also:
http://hdf-forum.184993.n3.nabble.com/Global-cache-size-td4027913.html

However, I’d like to be able to experiment with different cache sizes
without having to recompile the software in question. So I’d propose
adding an environment variable to be used like this:

HDF5_CHUNK_CACHE_SIZE=128M |

This would override the default, but it could be overridden if the
application called |H5Pset_cache| or |H5Pset_chunk_cache|. Suffixes K, M
or G would multiply the number by the relevant power of 1024 to get a
size in bytes.

That’s one neat idea! However, I’m pretty sure it will break some
existing applications relying on the defaults…

Best wishes,
Andrey Paramonov


#3

Thanks Andrey! I did see your thread as well, and I think a global cache limit makes sense. For the use case I’m working with, however, it’s easiest to work with a per-dataset cache size: we are reading a slab at a time from a number of different datasets, and we want to make sure that one chunk of each dataset can always be cached.