chunk cache

Hi Francesc,

I realize we have very different needs and are in a way talking about different things.

You apply the chunk cache to pytables which is effectively a 1-dim array of records. The size of each record can be different. Usually access to it is linear or random. You don't use the chunking mechanism of the datasets, so you do not need to assemble the data array from various chunks and can use a zero-copy strategy (provided the endianness, etc. match).
The size of the chunk cache is hard to predict for random access, so in a way any size may do and LRU is usually the best mechanism.

On the other hand I want to store an N-dim array of fixed sized data types (integer, real, complex) in a chunked way which requires that when reading a data array it is assembled from the various chunks (thus a lot of copying). Given an access pattern it is possible to predict the optimal size of the cache. If the cache is too small, MRU is the best way to deal with the cache because it will keep chunks in the cache, while LRU will have removed the ones you are needing.
E.g. an array of 100x100 chunked in 10X10 chunks. You need 10 chunks when accessing the array by vector (in x or y). If the cache can contain 9 chunks only, you can see that MRU can keep 8 or 9 chunks in the cache while LRU will thrash.

I think that the HDF5 chunking mechanism is primarily meant for the latter case, but it seems that the chunk cache came in as some kind of an afterthought (but I might be wrong).
I think a single chunk cache mechanism can serve both cases well, but might be somewhat hard.
Something to think about for Quincey.
I'm afraid it'll take quite a while before a more flexible caching mechanism is added to HDF5. Until then HDF5 might be of limited use to us.

As an aside.
There is an image processing package called Karma (written by Richard Gooch) using a different approach. It uses mmap to access a chunked data array (so the OS takes care of caching). Getting array element [k,j,i] is done as array[ xoffset[i] + yoffset[j] + zoffset[k] ].
For any chunk size you can precalculate the 3 offset vectors. You can even have chunks within chunks (outer chunks map to disk pages, inner chunks to processor cache). This scheme is very useful when doing array access in all dimensions around a pixel (e.g. to find the shape of an object), but also works well for other access patterns.
But mmap has its limitations in file size, although that is not a problem anymore on 64-bit systems.

Cheers,
Ger

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Saturday 22 March 2008, Francesc Altet escrigué:

Here are my figures:
                                     No compression Zlib compression
No LRU cache (reads from OS cache): 17.0 Krec/s 6.5 Krec/s
LRU cache speedup (0.999 eff.): +500% +1300%
LRU cache speedup (0.975 eff.): +340% +1000%
LRU cache speedup (0.479 eff.): +14% +47%
LRU cache speedup (0.000 eff.): -47% -14%

A quick update. After this discussion I realized that when a cache is
very innefficient (< 0.2), its contents can be happily dropped, and
this should suppose better performance for this case (less time doing
not useful lookups), and besides, this lead to better memory usage (the
cache is emptied). I've implemented this and here are my results for
the innefficient case (the other cases should not be affected):

                                     No compression Zlib compression
No LRU cache (reads from OS cache): 17.0 Krec/s 6.5 Krec/s
LRU cache speedup (0.000 eff.): -19% -4%

As you can see, these percentages are much better than before. So, it
seems like the adaptive size cache is a good idea indeed.

Regards,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

BTW. Compression is no option for us. The data are floating point and
noisy, but may contain signal. Each bit is important.

? You must be thinking of lossy compression like JPEG files, where some
of the information is lost. HDF5 uses zlib or szip which are both
lossless compression techniques -- what you get back is exactly what you
put in, it's just that it stores it in a format that takes up less
space, at the cost of more computing work to compress/decompress it.

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Francesc,

Hi Quincey,

A Tuesday 25 March 2008, escriguéreu:

After some experiments, I've noticed some things that I'd like to
bring
to your attention:

1. The chunk cache in HDF5 seems to keep the chunks compressed.
Why is
this? One of the nice things about a cache in HDF5 itself should
be to
keep chunks uncompressed so that the retrieval is much faster
(around to 3x when using zlib). Besides, the OS already caches
compressed chunks; so having this info replicated, and not
obtaining a measurable speed-up, seems a waste of space.

  No, the HDF5 library caches uncompressed chunks.

There is something that I don't understand in my experiments, then.
I've repeated them today, and I can confirm them. My setup is reading
entire chunks (8184 bytes) of a chunked table (with 100000, 24-byte,
records for a total of 2.4 MB or 341 chunks) out of an HDF5 file.

Here are my results:

Using no compression
~~~~~~~~~~~~~~~~~~~~
- Disabling the internal caches (just using OS filesystem cache):
21.0 Kchunks/sec

- Using a LRU cache for chunks
Efficiency HDF5 1.8.0 PyTables Pro PyTables Pro (read-only)
0.000 18.0 16.0 16.0
0.479 18.5 23.5 28.0
0.975 20.5 57.0 93.0
0.999 21.0 82.0 117.

Using compression (zlib level 1 + shuffle)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Disabling the internal caches (just using OS filesystem cache):
7.60 Kchunks/sec

- Using a LRU cache for chunks
Efficiency HDF5 1.8.0 PyTables Pro PyTables Pro (read-only)
0.000 7.60 6.70 6.90
0.479 7.40 11.3 12.2
0.975 6.90 50.3 77.0
0.999 7.00 79.0 112.

All the figures in the above tables are in Kchunks per second. PyTables
Pro means the LRU cache used on it (I've disabled the HDF5 cache in
this case). The read-only qualifier means that the data delivered is
meant only for read only purposes (i.e. it is non memcpy-ed into user
buffers). The sizes of the caches in both HDF5 and PyTables has been
set to 1 MB and 128 slots (except for the 0.000 efficiency entry, where
the number of slots has been reduced to 12), and I've chosen the chunks
to be selected following a normal distribution with different sigmas in
order to achieve different efficiencies.

As you can see, the performance of the chunk cache in HDF5 (1.8.0, but
1.6.7 seems to behave similarly) when using compression is much worse
than if not using it (contrarily to the PyTables Pro cache that is
affected only by a few extent, as expected). That's why I thought that
the HDF5 chunk cache was keeping the data compressed :-/

  Interesting... I don't have a good reason for this behavior though. :frowning: I'll keep it in mind when I'm working on the chunk cache and see if I can find out what's going on.

2. When retrieving entire chunks from the cache, the HDF5 chunk
cache does not seem very efficient and, in fact, it is slower than
if cache is not active, even for access patterns with high spatial
locality. My
guess is that the chunk cache in HDF5 is more oriented for
accelerating
the reading of small data buckets, instead on complete sets of
chunks. I don't know, but perhaps adding a parameter to the
H5Pset_cache to indicate the typical size of data to be retrieved
could be useful for optimizing the size of cache and its internal
structures.

  If the chunk cache is not large enough to hold at least one chunk,
this sometimes happens. It's one of the effects that I'm going to
try to mitigate with my forthcoming chunk caching improvements.

Well, what I meant is that when retrieving complete chunks (in the
benchmark above, the chunksize was around 8 KB and the cache size 1 MB,
so it has capacity for up to 128 chunks) the performance of the HDF5
chunk cache is quite poor (and effectively slower than if the cache is
disabled). You can see this effect in the tables above.

I'm rather mystified about this, but I'm quite sure that I'm
enabling/disabling the HDF5 chunk cache correctly (I can see better
speed-ups with retrievals smaller than a chunk). I was using the next
code to enable/disable the cache in HDF5:

access_plist = H5Pcreate(H5P_FILE_ACCESS);
/* H5Pset_cache(access_plist, 0, 0, 0, 0.0); */ /* disable cache*/
H5Pset_cache(access_plist, 0, 128, 128*8*1024, 0.0); /* 1 MB, 128 slots
*/
file_id = H5Fopen(name, H5F_ACC_RDONLY, access_plist)

So, I'm wondering if perhaps something is not working as it should in
the chunk cache when asking for relatively large data buckets (but,
still, much smaller than cache size).

  I don't know, but I'll keep this in mind also.

3. When retrieving large data from caches (one or several chunks),
many
time is spent copying the data to user buffers, but in many cases,
the data doesn't need a fresh data container (for example, when it
will be used to doing calculations, which is a very common
situation). So, maybe adding the concept of read-only data when
delivering it to the user can be useful in order to accelerate the
access to data in caches.
I've tested this with our implementation, and the access can be up
to a
40% faster than if you had to provide a copy.

  This is an interesting idea, but it would require the library to
have a supply of "read only" buffers to loan to the application, which
would then be responsible for checking them back in with the HDF5
library. I'm not certain most users would like this model...

Yeah. I thought this after sending the message. However, it is
fortunate that PyTables users can benefit from NumPy containers that
support the concept of read-only buffers right-out-of-the-box.
Provided the rather large improvement in speed that one can expect,
I'll look at integrating the read-only caches in next version of
PyTables Pro.

  I didn't mean to imply that I was going to implement this! :slight_smile: It's just something to add to the HDF5 "wish list"...

  Quincey

···

On Mar 26, 2008, at 1:17 PM, Francesc Altet wrote:

Cheers,

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Wednesday 26 March 2008, Ger van Diepen escrigué:

Doesn't the nature of chunking force you to make a copy. I.e. the
user will expect an array (section) in natural order, while the data
are chunked in the HDF5 dataset. So you have to copy the various bits
and pieces to rearrange the data.

Well, that depends on your application and the flexibility of your tools to deal with complex data structures.

For example, an important structure in PyTables (an HDF5 library for the Python language) are the tables (indeed ;), which are implemented as unidimensional chunked datasets of compound types (which can be made of multidimensional subtypes, but that's not interesting now) on top of HDF5. When doing selections using the so-called in-kernel iterator, the chunks of a table are read into an internal buffer which is passed through the expression evaluator that computes the boolean results for the query on this chunk (the chunks are actually grouped in buffers, but this is not interesting for this discussion). Based on these results, the iterator decides which rows has to be passed to the user.

Incidentally, the rows can actually be containers offering *views* of the internal buffers, so data copies would not be required at all in this step. The user can then decide what to do with the rows that the iterator is handling to him. If all he wants is using the data to do new computations, the rows that are handled to the user can perfectly be read-only containers. On the contrary, if the user is willing to overwrite the row containers, the only only thing he has to do is to copy them to another containers (because the former were read-only).

In this example, you can see how data that is hosted on a internal cache can make its way towards user-space without doing copies at all.

BTW. Compression is no option for us. The data are floating point and
noisy, but may contain signal. Each bit is important.

I was speaking about *lossless* compression. Or do you mean that a compressor like zlib is not being able to compress your data at all?
If this is the case, have you tried to apply the shuffle filter first? It generally helps a lot compressors to further shrink its inputs.

Cheers,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

A Wednesday 26 March 2008, Quincey Koziol escrigué:

>> This is an interesting idea, but it would require the library to
>> have a supply of "read only" buffers to loan to the application,
>> which
>> would then be responsible for checking them back in with the HDF5
>> library. I'm not certain most users would like this model...
>
> Yeah. I thought this after sending the message. However, it is
> fortunate that PyTables users can benefit from NumPy containers
> that support the concept of read-only buffers right-out-of-the-box.
> Provided the rather large improvement in speed that one can expect,
> I'll look at integrating the read-only caches in next version of
> PyTables Pro.

  I didn't mean to imply that I was going to implement this! :slight_smile: It's
just something to add to the HDF5 "wish list"...

Don't be worried Quincey :wink: What I was saying is that I'm considering
doing a replacement of the HDF5 chunk cache by *our own* chunk cache
with the read-only feature, at least in sensible places (like indexes
datasets, where performance is critical).

Cheers!

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Ger,

A Friday 28 March 2008, Ger van Diepen escrigué:

Hi Francesc,

I realize we have very different needs and are in a way talking about
different things.

Indeed. There are many, many different scenarios when using a so
flexible library as HDF5.

You apply the chunk cache to pytables which is effectively a 1-dim
array of records. The size of each record can be different.

No, the case for PyTables is using the same record size for all the
entries in the table (this is in order to achieve maximum efficiency in
terms of data access and computations).

Usually access to it is linear or random.

That's correct.

You don't use the chunking mechanism of the datasets,

Not true. PyTables does use chunking extensively, most specially in
table datasets.

so you do not need to assemble the data
array from various chunks and can use a zero-copy strategy (provided
the endianness, etc. match).

Well, I realised that the example I gave yestarday was a bit simplistic,
and in fact I do need to assemble data in the pipeline data processing
(mainly for getting contiguous datasets and hence accelerating
computations). But inside PyTables itself there are situations where
this zero-copy strategy is actually used. My point is that a read-only
cache can accelerate things in *some* situations (but not always, of
course). And in these cases, this acceleration can be very beneficial
for an application.

The size of the chunk cache is hard to
predict for random access, so in a way any size may do and LRU is
usually the best mechanism.

Well, I don't grasp what you exactly mean here, but the main reason why
PyTables is using LRU caches internally is because it is normally used
interactively for data mining purposes and a frequent usage pattern is
to refine queries. For example, a typical sequence of queries in
PyTables could be:

1. [(row['lati'], row['long']) for row in table.where('pressure<200')]

2. [(row['lati'], row['long']) for row in table.where('(pressure>100) &
(pressure<200)']

3. [(row['lati'], row['long']) for row in table.where('((pressure<200) &
(pressure>100)) & ((temperature>10) & (temperature<15))']

So, it is out of doubt that a LRU cache for internal structures is our
best bet here.

On the other hand I want to store an N-dim array of fixed sized data
types (integer, real, complex) in a chunked way which requires that
when reading a data array it is assembled from the various chunks
(thus a lot of copying). Given an access pattern it is possible to
predict the optimal size of the cache. If the cache is too small, MRU
is the best way to deal with the cache because it will keep chunks in
the cache, while LRU will have removed the ones you are needing. E.g.
an array of 100x100 chunked in 10X10 chunks. You need 10 chunks when
accessing the array by vector (in x or y). If the cache can contain 9
chunks only, you can see that MRU can keep 8 or 9 chunks in the cache
while LRU will thrash.

I entirely agree that what is good for an application cannot be for
another, and you has given a good example where a LRU schema performs
really badly (but again, you always have the OS cache backing you up,
so the situation is not that bad ;-).

I think that the HDF5 chunking mechanism is primarily meant for the
latter case, but it seems that the chunk cache came in as some kind
of an afterthought (but I might be wrong). I think a single chunk
cache mechanism can serve both cases well, but might be somewhat
hard. Something to think about for Quincey.
I'm afraid it'll take quite a while before a more flexible caching
mechanism is added to HDF5. Until then HDF5 might be of limited use
to us.

Well, having a flexible caching in HDF5, supporting different methods
for different access patterns would be absolutely great. However, as
you already know, caching is a very tricky thing, and providing a
method suited for everyone's needs in a general library like HDF5
should represent a fair amount of work (this is more a matter of the
HDF5 crew, and the amount of resources they can afford to put at it,
but frankly, I'm not convinced that this would be a critical thing to
do, though).

IMHO, if a user has a desperate need for optimize some particular access
pattern that it is not efficiently implemented in HDF5, he should
carefully analyze it and come up with a sensible solution (be
implemented by himself or stolen from anywhere).

As an aside.
There is an image processing package called Karma (written by Richard
Gooch) using a different approach. It uses mmap to access a chunked
data array (so the OS takes care of caching). Getting array element
[k,j,i] is done as array[ xoffset[i] + yoffset[j] + zoffset[k] ]. For
any chunk size you can precalculate the 3 offset vectors. You can
even have chunks within chunks (outer chunks map to disk pages, inner
chunks to processor cache). This scheme is very useful when doing
array access in all dimensions around a pixel (e.g. to find the shape
of an object), but also works well for other access patterns. But
mmap has its limitations in file size, although that is not a problem
anymore on 64-bit systems.

I've also done some experiments with mmap, and frankly, I've found HDF5
to be far superior in terms of flexibility and performance (if
correctly used).

OTOH, letting the OS taking care of caching in an HDF5 app is as simple
as disabling internal chunk caching in HDF5. I don't know Karma, but
if you are using HDF5, you can still play with chunkshapes in order to
allow the OS filesystem cache to get the most from your access pattern.

However, as Quincey already said (and it is also my experience), you
should keep in mind that you can always expect more performance from a
application specific cache that from the OS cache. See, as for one,
the benchmaks that I send to this list yesterday, where PyTables Pro
internal cache can be up to more than 5x faster than the OS cache (15x
if using compression).

Finally, let me say that, in my experience, HDF5 is a *very* good piece
of software and, perhaps more importantly, very well maintained, which
is critical for data persistence apps. I've been very happy using it
for more than 5 years now and I've never found disappointed with it.
Of course, it is not perfect (nothing is), but do not underrate it
because, say, it doesn't have provision for your specific access
pattern. Complain, implement a new cache system, contribute it. But
be sure to ponder the complete bunch of features that HDF5 is offering
before deciding if it is for you or not.

Regards,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Jason,

True, but it can compress only a bit. I've gzipped such a file and saved less than 5%. Not worth it. bzip2 saves even less.
What is worse is that (AFAIK) HDF5 needs new file space if a compressed chunk grows after being changed. Thereafter the old space is lost, certainly after closing the file. So your file can actually grow in such cases. I guess compression should preferably be used with write-once data.

Cheers,
Ger

"Jason Sachs" <jsachs@dekaresearch.com> 03/26/08 3:13 PM >>>

BTW. Compression is no option for us. The data are floating point and
noisy, but may contain signal. Each bit is important.

? You must be thinking of lossy compression like JPEG files, where some
of the information is lost. HDF5 uses zlib or szip which are both
lossless compression techniques -- what you get back is exactly what you
put in, it's just that it stores it in a format that takes up less
space, at the cost of more computing work to compress/decompress it.

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

If your applications are writing these data as part of runs that takes a
relatively long time, you might consider using the h5repack tool which
will reclaim lost space. You can also set compresion / shuffling /
checksum filters on your data with the tool.

Dave McCloskey

···

-----Original Message-----
From: Ger van Diepen [mailto:diepen@astron.nl]
Sent: Wednesday, March 26, 2008 10:39 AM
To: Jason Sachs
Cc: hdf-forum@hdfgroup.org
Subject: RE: chunk cache

Hi Jason,

True, but it can compress only a bit. I've gzipped such a file and saved
less than 5%. Not worth it. bzip2 saves even less.
What is worse is that (AFAIK) HDF5 needs new file space if a compressed
chunk grows after being changed. Thereafter the old space is lost,
certainly after closing the file. So your file can actually grow in such
cases. I guess compression should preferably be used with write-once
data.

Cheers,
Ger

"Jason Sachs" <jsachs@dekaresearch.com> 03/26/08 3:13 PM >>>

BTW. Compression is no option for us. The data are floating point and
noisy, but may contain signal. Each bit is important.

? You must be thinking of lossy compression like JPEG files, where some
of the information is lost. HDF5 uses zlib or szip which are both
lossless compression techniques -- what you get back is exactly what you
put in, it's just that it stores it in a format that takes up less
space, at the cost of more computing work to compress/decompress it.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Ger,

AFAIK the file will only grow significantly if you close the file before
rewriting. Otherwise, since for compression you must also use chunking, the
old chunks should be overwritten. Also, you can check around for lossless
algorithms for your type of data. There exist some particular compression
schemes suitable for particular data structures, like meshes for example.

BTW very interesting thread

BR

-- dimitris

···

2008/3/26, Ger van Diepen <diepen@astron.nl>:

Hi Jason,

True, but it can compress only a bit. I've gzipped such a file and saved
less than 5%. Not worth it. bzip2 saves even less.
What is worse is that (AFAIK) HDF5 needs new file space if a compressed
chunk grows after being changed. Thereafter the old space is lost, certainly
after closing the file. So your file can actually grow in such cases. I
guess compression should preferably be used with write-once data.

Cheers,
Ger

>>> "Jason Sachs" <jsachs@dekaresearch.com> 03/26/08 3:13 PM >>>

>BTW. Compression is no option for us. The data are floating point and
>noisy, but may contain signal. Each bit is important.

? You must be thinking of lossy compression like JPEG files, where some
of the information is lost. HDF5 uses zlib or szip which are both
lossless compression techniques -- what you get back is exactly what you
put in, it's just that it stores it in a format that takes up less
space, at the cost of more computing work to compress/decompress it.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
- What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn Your
Butt Tomorrow