Chunk cache size and performance

Hi,

I'm doing some small benchmarks to present on a forthcoming workshop, and I'd
be very grateful if someone can explain shed some light on the performance
figures that I'm getting.

What I want to stress during the workshop is the dependency of I/O throughput
on the chunksize for a certain dataset. For making the plots that I've got
(attached), I have chosen a dataset of 2 GB (2-dim, shape is (512, 65536) and
datatype is double precision) so that it can easily fit into my OS cache
memory (my machine has 8 GB) and make the effects clearer. In the X axis, I
represent the chunksize for every dataset (from 1 KB up to 8 MB). In the Y
axis there is the performance for reading the dataset sequentially.

Now, for for a chunk cache size of 1 MB (figure 'sequential-1MB.pdf'), it can
be seen that HDF5 can read at up to 1.6 GB/s (which is pretty good :-).
However, if I raise the chunk cache size to 8 MB, the peak performance falls
down to a mere 1.0 GB/s, that is, almost a 40% less. The Blosc compressor
performance is also very affected by this (slower compressors like LZO or Zlib
does not notice this effect very much because they are the obvious
bottleneck).

I've tried with other cache sizes, and the smaller, the better (reaching a
performance of almost 1.9 GB/s on my machine for a cache size of 128 KB).
Varying the number of slots in cache does not seem to affect performance too
much here.

My guess is that the guilty of this significant performance penalty is the
chunk cache subsystem of HDF5. Could anyone confirm this? In case this is
true, do you think that some optimization in that regard could be carried out
in the future?

Thanks,

sequential-8MB.pdf (18.3 KB)

sequential-1MB.pdf (18.3 KB)

···

--
Francesc Alted

I'd appreciate a bit more explanation of your methodology. You want
to test *I/O throughput* but at the same time you want to make sure
the data fits in memory cache. Are you not then just testing memory
bandwidth?

If I were running this benchmark I would be purging the memory cache
between every run: the chunk cache is designed to improve disk
performance, right?

==rob

···

On Thu, Jan 07, 2010 at 08:30:57PM +0100, Francesc Alted wrote:

What I want to stress during the workshop is the dependency of I/O throughput
on the chunksize for a certain dataset. For making the plots that I've got
(attached), I have chosen a dataset of 2 GB (2-dim, shape is (512, 65536) and
datatype is double precision) so that it can easily fit into my OS cache
memory (my machine has 8 GB) and make the effects clearer. In the X axis, I
represent the chunksize for every dataset (from 1 KB up to 8 MB). In the Y
axis there is the performance for reading the dataset sequentially.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

[Trying again to post a stripped down version of a message that was previously
refused by the list as being too large]

Just for archival purposes,

I've been profiling the case below, and I have come with more clues. I've got
a profile for different HDF5 cache sizes, namely, 256 KB, 1 MB and 8
MB. The dataset size is 2 GB and the chunksize for this tests was 512 KB.

When the cache size is 256 KB, the cache cannot keep anything (so it does not
enter in action), the most consuming function is Python code, and maximum
throughput is achieved (~1.8 GB/s). When the cache size is 1 MB, HDF5 cache
enters in action, lots of memcpy calls happen, and throughput is a bit slower
(~1.7 GB/s). Finally, for 8 MB, calls to memcpy grows considerably (2x than 1
MB case), making the throughput fall down to 1.0 GB/s.

Now, I don't completely understand what makes the 8 MB scenario to call 2x
more memcpy than for the 1 MB case. Of course, this additional 2x should be
responsible for the evident degradation in performance.

Ideas?

A Thursday 07 January 2010 20:30:57 Francesc Alted escrigué:

···

Hi,

I'm doing some small benchmarks to present on a forthcoming workshop, and
I'd be very grateful if someone can explain shed some light on the
performance figures that I'm getting.

What I want to stress during the workshop is the dependency of I/O
throughput on the chunksize for a certain dataset. For making the plots
that I've got (attached), I have chosen a dataset of 2 GB (2-dim, shape is
(512, 65536) and datatype is double precision) so that it can easily fit
into my OS cache memory (my machine has 8 GB) and make the effects
clearer. In the X axis, I represent the chunksize for every dataset (from
1 KB up to 8 MB). In the Y axis there is the performance for reading the
dataset sequentially.

Now, for for a chunk cache size of 1 MB (figure 'sequential-1MB.pdf'), it
can be seen that HDF5 can read at up to 1.6 GB/s (which is pretty good
:-). However, if I raise the chunk cache size to 8 MB, the peak
performance falls down to a mere 1.0 GB/s, that is, almost a 40% less.
The Blosc compressor performance is also very affected by this (slower
compressors like LZO or Zlib does not notice this effect very much because
they are the obvious bottleneck).

I've tried with other cache sizes, and the smaller, the better (reaching a
performance of almost 1.9 GB/s on my machine for a cache size of 128 KB).
Varying the number of slots in cache does not seem to affect performance
too much here.

My guess is that the guilty of this significant performance penalty is the
chunk cache subsystem of HDF5. Could anyone confirm this? In case this is
true, do you think that some optimization in that regard could be carried
out in the future?

Thanks,

--
Francesc Alted

Hi Rob,

A Friday 08 January 2010 16:10:01 Rob Latham escrigué:

> What I want to stress during the workshop is the dependency of I/O
> throughput on the chunksize for a certain dataset. For making the plots
> that I've got (attached), I have chosen a dataset of 2 GB (2-dim, shape
> is (512, 65536) and datatype is double precision) so that it can easily
> fit into my OS cache memory (my machine has 8 GB) and make the effects
> clearer. In the X axis, I represent the chunksize for every dataset
> (from 1 KB up to 8 MB). In the Y axis there is the performance for
> reading the dataset sequentially.

I'd appreciate a bit more explanation of your methodology. You want
to test *I/O throughput* but at the same time you want to make sure
the data fits in memory cache. Are you not then just testing memory
bandwidth?

Nope. I'd like to characterize a situation where I can maximize both
sequential and 'semi-random' access to a file. By 'semi-random' I mean an
access mode that performs random access in a certain row and then repeat the
operation with other rows. As the shape of my 2 GB dataset is (512, 524288)
--the stated shape in my previous message was wrong, sorry--, I need at least
4 MB for the chunk cache size so as to maximize the access time in such a
'semi-random' mode. I'm attaching a couple of plots where it is shown how the
8 MB cache works much better than the default cache size of 1 MB.

Unfortunately, choosing 8 MB does have an important impact in the sequential
access mode, as explained in the OP. The thing is that I don't completely
understand why this is so (i.e. I don't understand well how the HDF5 chunk
cache works ;-).

I'm attaching the script that I'm using for this case, if that helps to
clarify things. It is made in Python/PyTables, but I think it is simple
enough to be understood, at least at high level.

If I were running this benchmark I would be purging the memory cache
between every run: the chunk cache is designed to improve disk
performance, right?

That's an interesting idea. How the HDF5 chunk cache can be purged? However,
my latest profiles don't suggest this could help. I've sent these profiles to
this list, but as they are screenshots of the kcachegrind graphical tool, the
total size of the message exceeds what is allowed in this forum. I'll try to
reduce the message size and send it again.

Thanks,

random-1MB.pdf (16.9 KB)

random-8MB.pdf (16.9 KB)

optimal-chunksize.py (3.45 KB)

···

On Thu, Jan 07, 2010 at 08:30:57PM +0100, Francesc Alted wrote:

--
Francesc Alted

Hi Fransesc,

Why does HDF5 have to reserve space for each IO operation? Space is only
needed for the cache and the buffer supplied by the caller. The cache
space can be allocated only once because all chunks have the same size.
I would expect the IO and cache to work like:

The cache contains uncompressed chunks (otherwise the system's cache can
be used as well).
Reading (or writing) is done like:
- Iterate over the data to be read and determine which chunks to read
- Per chunk check if in the cache. If so, copy the required data part to
the supplied buffer.
- If not in cache, read that chunk into a (pre-allocated) buffer,
uncompress to a free slot in the cache, and copy to supplied buffer. If
no free slot, it has to remove a chunk from the cache using e.g. a
least-recently-used algorithm.
- If the data are not compressed (as in my case), it can read directly
into the cache slot.

If it is not working that way, I probably have a very incorrect view of
the purpose of the HDF5 cache. Quincey may be able to shed some light.

I noticed in your pictures that apart from HF5L_reg_malloc and _free,
also a lot of calls to H5D-btree-cmp3 are done. I assume these can be
expensive calls. Such calls are expected and therefore I was wondering
if they are the culprit. However, I don't understand why it is using a
btree comparison, because the documentation says that the chunk cache
uses a hash algorithm, not a btree to find chunks in the cache.

Note that casacore (that we are using as well), also supports chunking
and caching. It is much faster for the smaller IO operations, so it is
an implementation issue.

Cheers,
Ger

Francesc Alted 01/11/10 7:07 AM >>>

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and access
patterns. It got very slow when iterating through the data by vector

in

the Z-direction. I believe it was filed as a bug by the HDF5 group. I
sent a test program to Quincey that shows the behaviour. I'll forward
that mail and the test program to you, so you can try it out yourself

if

you like to.

I suspect the cache lookup algorithm to be the culprit. The larger the
cache and the more often it has to look up, the slower things get.

BTW,

Did you adapt the cache's hash size to the number of slots in the

cache?

Thanks for your suggestion. I've been looking at your problem, and my
profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your test
bed
reading in the x axis with a cache size of 4 KB (the HDF5 cache
subsystem does
not enters in action at all) and 256 KB (your size). I've also added a
profile for the tiled case for comparison purposes. For all the
profiles
(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and
`H5FL_reg_malloc` calls, no matter how large the cache size is (even if
it
does not enters in action).

I think this is expected, because HDF5 has to reserve space for each I/O

operation. When you walk the dataset following directions x or y, you
have to
do (32*2)x more I/O operations than for the tiled case, and HDF5 needs
to book
(and free again!) (32*2)x more memory areas. Also, when you read
through the
z axis, the additional times to book/release memory is (32*32)x. All of
this
is consistent with both profiles and running the benchmark manually:

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing no
cache
real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp>
time ./tHDF5 1024 1024 10 32 32 2sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You should
better
adapt the chunk shape to your most used case (if you have just one, but
I know
that this is not typically the case).

In your tests you only mention the chunk size, but not the chunk

shape.

Isn't that important? It gives me the impression that in your tests

the

data are stored and accessed fully sequentially which makes the cache
useless.

Yes, chunk shape is important, sorry, I forgot this important detail.
As I
mentioned in a previous message to Rob Latham, I want to optimize 'semi-
random' access mode in a certain row of the dataset, so I normally
choose the
chunk shape as (1, X), where X is the needed value for obtaining a
chunksize
between 1 KB and 8 MB --if X is larger than the maximum number of
columns, I
expand the number of rows in the chunk shape accordingly.

Thanks,

···

--
Francesc Alted

Hi Ger,

Hi Fransesc,

Why does HDF5 have to reserve space for each IO operation? Space is only
needed for the cache and the buffer supplied by the caller.

You are right, unless the chunk cache is not being used. For example, on my
previous message,
the tHDF5-x-4KB.png capture has been taken when the HDF5 cache size was set
to just 4 KB, while the chunksize was 8KB, so the chunk cahe cannot be used
(at least I guess so). At any rate, maybe I was wrong at supposing that
H5FL_reg_free and H5FL_reg_malloc calls where made just to free/malloc space
that is needed to read chunks.

But I agree that a more informed opinion about this would be more than
welcome :wink:

···

2010/1/11 Ger van Diepen <diepen@astron.nl>

--
Francesc Alted

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and access
patterns. It got very slow when iterating through the data by vector in
the Z-direction. I believe it was filed as a bug by the HDF5 group. I
sent a test program to Quincey that shows the behaviour. I'll forward
that mail and the test program to you, so you can try it out yourself if
you like to.

I suspect the cache lookup algorithm to be the culprit. The larger the
cache and the more often it has to look up, the slower things get. BTW,
Did you adapt the cache's hash size to the number of slots in the cache?

In your tests you only mention the chunk size, but not the chunk shape.
Isn't that important? It gives me the impression that in your tests the
data are stored and accessed fully sequentially which makes the cache
useless.

Cheers,
Ger

Francesc Alted 01/07/10 8:32 PM >>>

Hi,

I'm doing some small benchmarks to present on a forthcoming workshop,
and I'd
be very grateful if someone can explain shed some light on the
performance
figures that I'm getting.

What I want to stress during the workshop is the dependency of I/O
throughput
on the chunksize for a certain dataset. For making the plots that I've
got
(attached), I have chosen a dataset of 2 GB (2-dim, shape is (512,
65536) and
datatype is double precision) so that it can easily fit into my OS cache

memory (my machine has 8 GB) and make the effects clearer. In the X
axis, I
represent the chunksize for every dataset (from 1 KB up to 8 MB). In
the Y
axis there is the performance for reading the dataset sequentially.

Now, for for a chunk cache size of 1 MB (figure 'sequential-1MB.pdf'),
it can
be seen that HDF5 can read at up to 1.6 GB/s (which is pretty good :-).

However, if I raise the chunk cache size to 8 MB, the peak performance
falls
down to a mere 1.0 GB/s, that is, almost a 40% less. The Blosc
compressor
performance is also very affected by this (slower compressors like LZO
or Zlib
does not notice this effect very much because they are the obvious
bottleneck).

I've tried with other cache sizes, and the smaller, the better (reaching
a
performance of almost 1.9 GB/s on my machine for a cache size of 128
KB).
Varying the number of slots in cache does not seem to affect performance
too
much here.

My guess is that the guilty of this significant performance penalty is
the
chunk cache subsystem of HDF5. Could anyone confirm this? In case this
is
true, do you think that some optimization in that regard could be
carried out
in the future?

Thanks,

···

--
Francesc Alted

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and access
patterns. It got very slow when iterating through the data by vector in
the Z-direction. I believe it was filed as a bug by the HDF5 group. I
sent a test program to Quincey that shows the behaviour. I'll forward
that mail and the test program to you, so you can try it out yourself if
you like to.

I suspect the cache lookup algorithm to be the culprit. The larger the
cache and the more often it has to look up, the slower things get. BTW,
Did you adapt the cache's hash size to the number of slots in the cache?

Thanks for your suggestion. I've been looking at your problem, and my
profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your test bed
reading in the x axis with a cache size of 4 KB (the HDF5 cache subsystem does
not enters in action at all) and 256 KB (your size). I've also added a
profile for the tiled case for comparison purposes. For all the profiles
(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and
`H5FL_reg_malloc` calls, no matter how large the cache size is (even if it
does not enters in action).

I think this is expected, because HDF5 has to reserve space for each I/O
operation. When you walk the dataset following directions x or y, you have to
do (32*2)x more I/O operations than for the tiled case, and HDF5 needs to book
(and free again!) (32*2)x more memory areas. Also, when you read through the
z axis, the additional times to book/release memory is (32*32)x. All of this
is consistent with both profiles and running the benchmark manually:

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing no cache
real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You should better
adapt the chunk shape to your most used case (if you have just one, but I know
that this is not typically the case).

In your tests you only mention the chunk size, but not the chunk shape.
Isn't that important? It gives me the impression that in your tests the
data are stored and accessed fully sequentially which makes the cache
useless.

Yes, chunk shape is important, sorry, I forgot this important detail. As I
mentioned in a previous message to Rob Latham, I want to optimize 'semi-
random' access mode in a certain row of the dataset, so I normally choose the
chunk shape as (1, X), where X is the needed value for obtaining a chunksize
between 1 KB and 8 MB --if X is larger than the maximum number of columns, I
expand the number of rows in the chunk shape accordingly.

Thanks,

···

--
Francesc Alted

Hi Francesc,

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and access
patterns. It got very slow when iterating through the data by vector in
the Z-direction. I believe it was filed as a bug by the HDF5 group. I
sent a test program to Quincey that shows the behaviour. I'll forward
that mail and the test program to you, so you can try it out yourself if
you like to.

I suspect the cache lookup algorithm to be the culprit. The larger the
cache and the more often it has to look up, the slower things get. BTW,
Did you adapt the cache's hash size to the number of slots in the cache?

Thanks for your suggestion. I've been looking at your problem, and my
profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your test bed
reading in the x axis with a cache size of 4 KB (the HDF5 cache subsystem does
not enters in action at all) and 256 KB (your size). I've also added a
profile for the tiled case for comparison purposes. For all the profiles
(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and
`H5FL_reg_malloc` calls, no matter how large the cache size is (even if it
does not enters in action).

I think this is expected, because HDF5 has to reserve space for each I/O
operation. When you walk the dataset following directions x or y, you have to
do (32*2)x more I/O operations than for the tiled case, and HDF5 needs to book
(and free again!) (32*2)x more memory areas. Also, when you read through the
z axis, the additional times to book/release memory is (32*32)x. All of this
is consistent with both profiles and running the benchmark manually:

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing no cache
real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You should better
adapt the chunk shape to your most used case (if you have just one, but I know
that this is not typically the case).

  Interesting results. I'll try to comment on a number of fronts:
    - The H5FL_* routines are an internal "free list" that the HDF5 library uses, in order to avoid calling malloc() as frequently. You can disable them for your benchmarking by configuring the library with the "--enable-using-memchecker".

    - It looks like you are using the hyperslab routines quite a bit, are you creating complicated hyperslabs?

    - The B-tree comparison routine is being used to locate the correct chunk to perform I/O on, which may or may not be in the cache. Do your datasets have fixed or unlimited dimensions? If they are fixed (or only have 1 unlimited dimension), I can point you at an experimental branch with the new chunk indexing features and we can see if that would help your performance.

  Quincey

···

On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:

In your tests you only mention the chunk size, but not the chunk shape.
Isn't that important? It gives me the impression that in your tests the
data are stored and accessed fully sequentially which makes the cache
useless.

Yes, chunk shape is important, sorry, I forgot this important detail. As I
mentioned in a previous message to Rob Latham, I want to optimize 'semi-
random' access mode in a certain row of the dataset, so I normally choose the
chunk shape as (1, X), where X is the needed value for obtaining a chunksize
between 1 KB and 8 MB --if X is larger than the maximum number of columns, I
expand the number of rows in the chunk shape accordingly.

Thanks,

--
Francesc Alted
<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

It uses fixed dimensions.
It iterates over vectors in the x, y or z direction. In the test the z
axis was quite short, so it results in many small hyperslab accesses.
Of course, I could access a plane at a time and get vectors myself, but
I would think that is just what the cache should do for me. Besides, I
do not always know if a plane fits in memory.

I can imagine that doing so many btree lookups is quite expensive. Do
you think that is the main performance bottleneck?
But why so many btree lookups?
Because the chunk and array size are fixed, it can derive the chunks to
use from the hyperslab coordinates. Then it can first look if they are
in the cache which is (I assume) much faster than looking in the chunk
Btree.
Does the experimental branch improve in this way?

Not that Francesc's original question is still open. Why does
performance degrade when using a larger chunk cache?

Cheers,
Ger

Hi Francesc,

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and

access

patterns. It got very slow when iterating through the data by

vector in

the Z-direction. I believe it was filed as a bug by the HDF5 group.

I

sent a test program to Quincey that shows the behaviour. I'll

forward

that mail and the test program to you, so you can try it out

yourself if

you like to.

I suspect the cache lookup algorithm to be the culprit. The larger

the

cache and the more often it has to look up, the slower things get.

BTW,

Did you adapt the cache's hash size to the number of slots in the

cache?

Thanks for your suggestion. I've been looking at your problem, and

my

profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your

test bed

reading in the x axis with a cache size of 4 KB (the HDF5 cache

subsystem

does

not enters in action at all) and 256 KB (your size). I've also

added a

profile for the tiled case for comparison purposes. For all the

profiles

(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and

`H5FL_reg_malloc` calls, no matter how large the cache size is (even

if it

does not enters in action).

I think this is expected, because HDF5 has to reserve space for each

I/O

operation. When you walk the dataset following directions x or y,

you have

to

do (32*2)x more I/O operations than for the tiled case, and HDF5

needs to

book

(and free again!) (32*2)x more memory areas. Also, when you read

through

the

z axis, the additional times to book/release memory is (32*32)x.

All of

this

is consistent with both profiles and running the benchmark

manually:

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing

no cache

real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You

should better

adapt the chunk shape to your most used case (if you have just one,

but I

know

that this is not typically the case).

  Interesting results. I'll try to comment on a number of

fronts:

    - The H5FL_* routines are an internal "free list" that

the HDF5 library uses,

in order to avoid calling malloc() as frequently. You can disable

them for

your benchmarking by configuring the library with the
"--enable-using-memchecker".

    - It looks like you are using the hyperslab routines

quite a bit, are you

creating complicated hyperslabs?

    - The B-tree comparison routine is being used to locate

the correct chunk to

perform I/O on, which may or may not be in the cache. Do your

datasets have

fixed or unlimited dimensions? If they are fixed (or only have 1

unlimited

dimension), I can point you at an experimental branch with the new

chunk

indexing features and we can see if that would help your

performance.

  Quincey

In your tests you only mention the chunk size, but not the chunk

shape.

Isn't that important? It gives me the impression that in your tests

the

data are stored and accessed fully sequentially which makes the

cache

useless.

Yes, chunk shape is important, sorry, I forgot this important

detail. As I

mentioned in a previous message to Rob Latham, I want to optimize

'semi-

random' access mode in a certain row of the dataset, so I normally

choose

the

chunk shape as (1, X), where X is the needed value for obtaining a

chunksize

between 1 KB and 8 MB --if X is larger than the maximum number of

columns, I

expand the number of rows in the chunk shape accordingly.

Thanks,

--
Francesc Alted

<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>____________________________________

···

On 1/12/2010 at 3:27 PM, in message <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982@hdfgroup.org>, Quincey Koziol <koziol@hdfgroup.org> wrote:

On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:
___________

Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi all,

I've been following this for a while, and while I still don't have a complete picture of what the benchmark is doing, I have an idea that might shed some light on this.

My understanding is that the chunk cache is set large enough to hold one vector of chunks but not one plane of chunks, correct? If the vectors are iterated over in the obvious way (i.e. one plane at a time) then each vector of chunks in the chunk cache will have to be thrown away once for every plane that passes through the chunks, obviously hurting performance. In this case, it would be better to disable the chunk cache entirely (set nbytes or nslots to 0), hence why Francesc sees better performance with the smaller chunk cache. Ideally, you would either use a chunk cache large enough for a plane of chunks, or iterate over the entire vector of chunks before moving on to the next vector of chunks in the plane.

Does this make sense?

Here are some (2-D) diagrams to help illustrate what I'm talking about:

In cache
    >
   V

···

+----+----+----+

X | | |
   > > >

+----+----+----+

   > > >

+----+----+----+

+----+----+----+

.X | | |
   > > >

+----+----+----+

   > > >

+----+----+----+

+----+----+----+

..X | | |
   > > >

+----+----+----+

   > > >

+----+----+----+

+----+----+----+

...X| | |
   > > >

+----+----+----+

   > > >

+----+----+----+

Evicted
   > In cache
  V V
+----+----+----+

....|X | |
   > > >

+----+----+----+

   > > >

+----+----+----+

+----+----+----+

....|.X | |
   > > >

+----+----+----+

   > > >

+----+----+----+

...
              In cache
                  V
+----+----+----+

....|....|...X|
   > > >

+----+----+----+

   > > >

+----+----+----+

Read back into cache from disk for the second time
    > Evicted
    V V
+----+----+----+

....|....|....|
X | | |

+----+----+----+

   > > >

+----+----+----+

...

-Neil

On 01/12/2010 09:10 AM, Ger van Diepen wrote:

Hi Quincey,

It uses fixed dimensions.
It iterates over vectors in the x, y or z direction. In the test the z
axis was quite short, so it results in many small hyperslab accesses.
Of course, I could access a plane at a time and get vectors myself, but
I would think that is just what the cache should do for me. Besides, I
do not always know if a plane fits in memory.

I can imagine that doing so many btree lookups is quite expensive. Do
you think that is the main performance bottleneck?
But why so many btree lookups?
Because the chunk and array size are fixed, it can derive the chunks to
use from the hyperslab coordinates. Then it can first look if they are
in the cache which is (I assume) much faster than looking in the chunk
Btree.
Does the experimental branch improve in this way?

Not that Francesc's original question is still open. Why does
performance degrade when using a larger chunk cache?

Cheers,
Ger

On 1/12/2010 at 3:27 PM, in message >>>> > <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982@hdfgroup.org>, Quincey Koziol > <koziol@hdfgroup.org> wrote:
   
Hi Francesc,

On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:
       

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and
         

access
   

patterns. It got very slow when iterating through the data by
         

vector in
   

the Z-direction. I believe it was filed as a bug by the HDF5 group.
         

I
   

sent a test program to Quincey that shows the behaviour. I'll
         

forward
   

that mail and the test program to you, so you can try it out
         

yourself if
   

you like to.

I suspect the cache lookup algorithm to be the culprit. The larger
         

the
   

cache and the more often it has to look up, the slower things get.
         

BTW,
   

Did you adapt the cache's hash size to the number of slots in the
         

cache?
   

Thanks for your suggestion. I've been looking at your problem, and
       

my
   

profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your
       

test bed
   

reading in the x axis with a cache size of 4 KB (the HDF5 cache
       

subsystem
   

does
     

not enters in action at all) and 256 KB (your size). I've also
       

added a
   

profile for the tiled case for comparison purposes. For all the
       

profiles
   

(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and
       
`H5FL_reg_malloc` calls, no matter how large the cache size is (even
       

if it
   

does not enters in action).

I think this is expected, because HDF5 has to reserve space for each
       

I/O
   

operation. When you walk the dataset following directions x or y,
       

you have
   

to
     

do (32*2)x more I/O operations than for the tiled case, and HDF5
       

needs to
   

book
     

(and free again!) (32*2)x more memory areas. Also, when you read
       

through
   

the
     

z axis, the additional times to book/release memory is (32*32)x.
       

All of
   

this
     

is consistent with both profiles and running the benchmark
       

manually:
   

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing
       

no cache
   

real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You
       

should better
   

adapt the chunk shape to your most used case (if you have just one,
       

but I
   

know
     

  that this is not typically the case).
   

  Interesting results. I'll try to comment on a number of
     

fronts:
   

    - The H5FL_* routines are an internal "free list" that
     

the HDF5 library uses,
   

in order to avoid calling malloc() as frequently. You can disable
     

them for
   

your benchmarking by configuring the library with the
"--enable-using-memchecker".

    - It looks like you are using the hyperslab routines
     

quite a bit, are you
   

creating complicated hyperslabs?

    - The B-tree comparison routine is being used to locate
     

the correct chunk to
   

perform I/O on, which may or may not be in the cache. Do your
     

datasets have
   

fixed or unlimited dimensions? If they are fixed (or only have 1
     

unlimited
   

dimension), I can point you at an experimental branch with the new
     

chunk
   

indexing features and we can see if that would help your
     

performance.
   

  Quincey

In your tests you only mention the chunk size, but not the chunk
         

shape.
   

Isn't that important? It gives me the impression that in your tests
         

the
   

data are stored and accessed fully sequentially which makes the
         

cache
   

useless.
         

Yes, chunk shape is important, sorry, I forgot this important
       

detail. As I
   

mentioned in a previous message to Rob Latham, I want to optimize
       

'semi-
   

random' access mode in a certain row of the dataset, so I normally
       

choose
   

the
     

chunk shape as (1, X), where X is the needed value for obtaining a
       

chunksize
   

between 1 KB and 8 MB --if X is larger than the maximum number of
       

columns, I
   

expand the number of rows in the chunk shape accordingly.

Thanks,

--
Francesc Alted

<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>____________________________________
   

___________
     

Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
       
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
     
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Neil,

Thanks for your answer.
I'm afraid I do not fully understand what you mean with a vector of
chunks and a plane of chunks.
Is a vector of chunks the collection of chunks you need to access an
entire vector in some direction? E.g. In a 3-dim case to get i,*,k (a
vector in y)? To process all such vectors (e.g. to average them) you
need to iterate over i and k.
And is a plane of chunks all chunks you need to get a plane, say,
*,*,k?

I imagine the cache to be a collection of chunks, say, a
std::vector<chunk> in C++ terms. Ideally the cache should be so large
that all chunks you need to get a hyperslab fit in it. It means that
when getting the next hyperslab all data needed are usually still in the
cache. So at the end, after iterating through the entire cube, all
chunks are read only once. So a chunk should never be read back as you
describe in your pictures.

My idea would be that when reading, say, 0,*,0 HDF5 knows it needs,
say, chunk 0, 4, 8, and 12. So it can first test if they are in the
cache (using the chunk cache hash) before searching the chunk Btree. If
the cache is large enough, these tests are almost always true. So a
Btree lookup needs to be done seldomly (in fact only when a chunk needs
to be read). I got the impression that HDF5 is doing a Btree lookup for
each hyperslab access instead of using the approach I sketched. I assume
that a Btree lookup is much more expensive than a hash lookup.
But maybe my idea of what the chunk cache looks like and how HDF5 uses
it is totally wrong.
Quincey or you can correct me.

In my test program the cache is sized that way. The test program uses
3-dim arrays and can iterate over vectors in any direction (say get
i,*,k for all i and k) or iterate over chunks. I'll attach the test
program, so you can see it. It knows the access pattern, so it sizes the
cache beforehand.
Note that in the vector case the naive iteration is over all i and then
over all k which requires a large cache. You can do better than that by
first iterating over the chunks in i and k and then inside these chunks
iterate over i and k. It requires a much smaller cache.

We also use the casacore package containing an RDBMS-like table system
that stores arrays in a chunked way and uses a cache when accessing
them. It performs way better when iterating over vectors in the way
described above. So it should be possible to improve HDF5 considerably.
BTW. casacore is moving towards using mmap when possible, so the OS
takes care of caching.

Cheers,
Ger

Hi all,

I've been following this for a while, and while I still don't have a

complete picture of what the benchmark is doing, I have an idea that

might shed some light on this.

My understanding is that the chunk cache is set large enough to hold

one

vector of chunks but not one plane of chunks, correct? If the

vectors

are iterated over in the obvious way (i.e. one plane at a time) then

each vector of chunks in the chunk cache will have to be thrown away

once for every plane that passes through the chunks, obviously

hurting

performance. In this case, it would be better to disable the chunk
cache entirely (set nbytes or nslots to 0), hence why Francesc sees
better performance with the smaller chunk cache. Ideally, you would

either use a chunk cache large enough for a plane of chunks, or

iterate

over the entire vector of chunks before moving on to the next vector

of

chunks in the plane.

Does this make sense?

Here are some (2-D) diagrams to help illustrate what I'm talking

about:

In cache
    >
   V
+----+----+----+
>X | | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>.X | | |
> > > >
+----+----+----+
>

   > > >

> > > >
+----+----+----+

+----+----+----+
>..X | | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>...X| | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

Evicted
   > In cache
  V V
+----+----+----+
>....|X | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>....|.X | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

...
              In cache
                  V
+----+----+----+
>....|....|...X|
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

Read back into cache from disk for the second time
    > Evicted
    V V
+----+----+----+
>....|....|....|
>X | | |
+----+----+----+
> > > >
> > > >
+----+----+----+

...

-Neil

Hi Quincey,

It uses fixed dimensions.
It iterates over vectors in the x, y or z direction. In the test the

z

axis was quite short, so it results in many small hyperslab

accesses.

Of course, I could access a plane at a time and get vectors myself,

but

I would think that is just what the cache should do for me. Besides,

I

do not always know if a plane fits in memory.

I can imagine that doing so many btree lookups is quite expensive.

Do

you think that is the main performance bottleneck?
But why so many btree lookups?
Because the chunk and array size are fixed, it can derive the chunks

to

use from the hyperslab coordinates. Then it can first look if they

are

in the cache which is (I assume) much faster than looking in the

chunk

Btree.
Does the experimental branch improve in this way?

Not that Francesc's original question is still open. Why does
performance degrade when using a larger chunk cache?

Cheers,
Ger

Hi Francesc,

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:
       

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and
         

access
   

patterns. It got very slow when iterating through the data by
         

vector in
   

the Z-direction. I believe it was filed as a bug by the HDF5

group.

         

I
   

sent a test program to Quincey that shows the behaviour. I'll
         

forward
   

that mail and the test program to you, so you can try it out
         

yourself if
   

you like to.

I suspect the cache lookup algorithm to be the culprit. The

larger

         

the
   

cache and the more often it has to look up, the slower things

get.

         

BTW,
   

Did you adapt the cache's hash size to the number of slots in

the

         

cache?
   

Thanks for your suggestion. I've been looking at your problem,

and

       

my
   

profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your
       

test bed
   

reading in the x axis with a cache size of 4 KB (the HDF5 cache
       

subsystem
   

does
     

not enters in action at all) and 256 KB (your size). I've also
       

added a
   

profile

for the tiled case for comparison purposes. For all the

       

profiles
   

(except tiled), the bottleneck is clearly in the `H5FL_reg_free`

and

       
`H5FL_reg_malloc` calls, no matter how large the cache size is

(even

       

if it
   

does not enters in action).

I think this is expected, because HDF5 has to reserve space for

each

       

I/O
   

operation. When you walk the dataset following directions x or

y,

       

you have
   

to
     

do (32*2)x more I/O operations than for the tiled case, and HDF5
       

needs to
   

book
     

(and free again!) (32*2)x more memory areas. Also, when you read
       

through
   

the
     

z axis, the additional times to book/release memory is (32*32)x.
       

All of
   

this
     

is consistent with both profiles and running the benchmark
       

manually:
   

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots //

forcing

       

no cache
   

real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You
       

should better
   

adapt the chunk shape to your most used case (if you have just

one,

       

but I
   

know
     

  that this is not typically the case).
   

  Interesting results. I'll try to comment on a number of
     

fronts:
   

    - The H5FL_* routines are an internal "free list" that
     

the HDF5 library uses,
   

in order to avoid calling malloc() as frequently. You can disable
     

them for
   

your benchmarking by configuring the library with the
"--enable-using-memchecker".

    - It looks like you are using the hyperslab routines
     

quite a bit, are you
   

creating complicated hyperslabs?

    - The B-tree comparison routine is being used to locate
     

the correct chunk to
   

perform I/O on, which may or may not be in the cache. Do your
     

datasets have
   

fixed or unlimited dimensions? If they are fixed (or only have 1
     

unlimited
   

dimension), I can point you at an experimental branch with the new
     

chunk
   

indexing features and we can see if that would help your
     

performance.
   

  Quincey

In your tests you only mention the chunk size, but not the chunk
         

shape.
   

Isn't that important? It gives me the impression that in your

tests

         

the
   

data are stored and accessed fully sequentially which makes the
         

cache
   

useless.
         

Yes, chunk shape is important, sorry, I forgot this important
       

detail. As I
   

mentioned in a previous message to Rob Latham, I want to optimize
       

'semi-
   

random' access mode in a certain row of the dataset, so I

normally

       

choose
   

the
     

chunk shape as (1, X), where X is the needed value for obtaining

a

       

chunksize
   

between 1 KB and 8 MB --if

X is larger than the maximum number of

       

columns, I
   

expand the number of rows in the chunk shape accordingly.

Thanks,

--
Francesc Alted

<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________

tHDF5.cc (13.3 KB)

···

On 1/13/2010 at 5:12 PM, in message <4B4DF0D7.9020209@hdfgroup.org>, Neil Fortner <nfortne2@hdfgroup.org> wrote:

On 01/12/2010 09:10 AM, Ger van Diepen wrote:

On 1/12/2010 at 3:27 PM, in message >>>>> >> <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982@hdfgroup.org>, Quincey Koziol >> <koziol@hdfgroup.org> wrote:

On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:

_____

   

___________
     

Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
       
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
     
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
   
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Ger,

Hi Neil,

Thanks for your answer.
I'm afraid I do not fully understand what you mean with a vector of
chunks and a plane of chunks.
Is a vector of chunks the collection of chunks you need to access an
entire vector in some direction? E.g. In a 3-dim case to get i,*,k (a
vector in y)? To process all such vectors (e.g. to average them) you
need to iterate over i and k.
And is a plane of chunks all chunks you need to get a plane, say,
*,*,k?

  Yes, I believe that's what Neil meant.

I imagine the cache to be a collection of chunks, say, a
std::vector<chunk> in C++ terms. Ideally the cache should be so large
that all chunks you need to get a hyperslab fit in it. It means that
when getting the next hyperslab all data needed are usually still in the
cache. So at the end, after iterating through the entire cube, all
chunks are read only once. So a chunk should never be read back as you
describe in your pictures.

  Yes, that's the ideal situation, but you've got to set the cache size large enough to make this occur.

My idea would be that when reading, say, 0,*,0 HDF5 knows it needs,
say, chunk 0, 4, 8, and 12. So it can first test if they are in the
cache (using the chunk cache hash) before searching the chunk Btree. If
the cache is large enough, these tests are almost always true. So a
Btree lookup needs to be done seldomly (in fact only when a chunk needs
to be read). I got the impression that HDF5 is doing a Btree lookup for
each hyperslab access instead of using the approach I sketched. I assume
that a Btree lookup is much more expensive than a hash lookup.
But maybe my idea of what the chunk cache looks like and how HDF5 uses
it is totally wrong. Quincey or you can correct me.

  The B-tree lookup is currently what maps the dataspace coordinates to the address of a chunk, and then the chunk in the cache is looked up by it's address. That being said, I see no reason why the chunks in the cache couldn't have the dataspace coordinates attached to them and have the cache searched before the B-tree was queried. I'll file a bug for this in our issue tracker.

In my test program the cache is sized that way. The test program uses
3-dim arrays and can iterate over vectors in any direction (say get
i,*,k for all i and k) or iterate over chunks. I'll attach the test
program, so you can see it. It knows the access pattern, so it sizes the
cache beforehand.
Note that in the vector case the naive iteration is over all i and then
over all k which requires a large cache. You can do better than that by
first iterating over the chunks in i and k and then inside these chunks
iterate over i and k. It requires a much smaller cache.

We also use the casacore package containing an RDBMS-like table system
that stores arrays in a chunked way and uses a cache when accessing
them. It performs way better when iterating over vectors in the way
described above. So it should be possible to improve HDF5 considerably.
BTW. casacore is moving towards using mmap when possible, so the OS
takes care of caching.

  Yes, I think if we improve our cache lookup strategy, we should be able to improve our I/O performance also. Thanks for the idea!

  Quincey

···

On Jan 14, 2010, at 1:59 AM, Ger van Diepen wrote:

Cheers,
Ger

On 1/13/2010 at 5:12 PM, in message > <4B4DF0D7.9020209@hdfgroup.org>, Neil > Fortner <nfortne2@hdfgroup.org> wrote:

Hi all,

I've been following this for a while, and while I still don't have a

complete picture of what the benchmark is doing, I have an idea that

might shed some light on this.

My understanding is that the chunk cache is set large enough to hold

one

vector of chunks but not one plane of chunks, correct? If the

vectors

are iterated over in the obvious way (i.e. one plane at a time) then

each vector of chunks in the chunk cache will have to be thrown away

once for every plane that passes through the chunks, obviously

hurting

performance. In this case, it would be better to disable the chunk
cache entirely (set nbytes or nslots to 0), hence why Francesc sees
better performance with the smaller chunk cache. Ideally, you would

either use a chunk cache large enough for a plane of chunks, or

iterate

over the entire vector of chunks before moving on to the next vector

of

chunks in the plane.

Does this make sense?

Here are some (2-D) diagrams to help illustrate what I'm talking

about:

In cache
   >
  V
+----+----+----+
>X | | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>.X | | |
> > > >
+----+----+----+
>

  > > >

> > > >
+----+----+----+

+----+----+----+
>..X | | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>...X| | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

Evicted
  > In cache
V V
+----+----+----+
>....|X | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

+----+----+----+
>....|.X | |
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

...
             In cache
                 V
+----+----+----+
>....|....|...X|
> > > >
+----+----+----+
> > > >
> > > >
+----+----+----+

Read back into cache from disk for the second time
   > Evicted
   V V
+----+----+----+
>....|....|....|
>X | | |
+----+----+----+
> > > >
> > > >
+----+----+----+

...

-Neil

On 01/12/2010 09:10 AM, Ger van Diepen wrote:

Hi Quincey,

It uses fixed dimensions.
It iterates over vectors in the x, y or z direction. In the test the

z

axis was quite short, so it results in many small hyperslab

accesses.

Of course, I could access a plane at a time and get vectors myself,

but

I would think that is just what the cache should do for me. Besides,

I

do not always know if a plane fits in memory.

I can imagine that doing so many btree lookups is quite expensive.

Do

you think that is the main performance bottleneck?
But why so many btree lookups?
Because the chunk and array size are fixed, it can derive the chunks

to

use from the hyperslab coordinates. Then it can first look if they

are

in the cache which is (I assume) much faster than looking in the

chunk

Btree.
Does the experimental branch improve in this way?

Not that Francesc's original question is still open. Why does
performance degrade when using a larger chunk cache?

Cheers,
Ger

On 1/12/2010 at 3:27 PM, in message >>>>>> >>> <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982@hdfgroup.org>, Quincey Koziol >>> <koziol@hdfgroup.org> wrote:

Hi Francesc,

On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:

Hi Ger,

A Friday 08 January 2010 08:25:09 escriguéreu:

Hi Francesc,

This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and

access

patterns. It got very slow when iterating through the data by

vector in

the Z-direction. I believe it was filed as a bug by the HDF5

group.

I

sent a test program to Quincey that shows the behaviour. I'll

forward

that mail and the test program to you, so you can try it out

yourself if

you like to.

I suspect the cache lookup algorithm to be the culprit. The

larger

the

cache and the more often it has to look up, the slower things

get.

BTW,

Did you adapt the cache's hash size to the number of slots in

the

cache?

Thanks for your suggestion. I've been looking at your problem,

and

my

profiles seem to say that it is not a cache issue.

Have a look at the attached screenshots showing profiles for your

test bed

reading in the x axis with a cache size of 4 KB (the HDF5 cache

subsystem

does

not enters in action at all) and 256 KB (your size). I've also

added a

profile

for the tiled case for comparison purposes. For all the

profiles

(except tiled), the bottleneck is clearly in the `H5FL_reg_free`

and

`H5FL_reg_malloc` calls, no matter how large the cache size is

(even

if it

does not enters in action).

I think this is expected, because HDF5 has to reserve space for

each

I/O

operation. When you walk the dataset following directions x or

y,

you have

to

do (32*2)x more I/O operations than for the tiled case, and HDF5

needs to

book

(and free again!) (32*2)x more memory areas. Also, when you read

through

the

z axis, the additional times to book/release memory is (32*32)x.

All of

this

is consistent with both profiles and running the benchmark

manually:

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots //

forcing

no cache

real 0m1.055s
user 0m0.860s
sys 0m0.168s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s

faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
setting cache to 5 chunks (40960 bytes) with 503 slots
real 0m14.813s
user 0m14.777s
sys 0m0.024s

So, in my opinion, there is little that HDF5 can do here. You

should better

adapt the chunk shape to your most used case (if you have just

one,

but I

know

that this is not typically the case).

  Interesting results. I'll try to comment on a number of

fronts:

    - The H5FL_* routines are an internal "free list" that

the HDF5 library uses,

in order to avoid calling malloc() as frequently. You can disable

them for

your benchmarking by configuring the library with the
"--enable-using-memchecker".

    - It looks like you are using the hyperslab routines

quite a bit, are you

creating complicated hyperslabs?

    - The B-tree comparison routine is being used to locate

the correct chunk to

perform I/O on, which may or may not be in the cache. Do your

datasets have

fixed or unlimited dimensions? If they are fixed (or only have 1

unlimited

dimension), I can point you at an experimental branch with the new

chunk

indexing features and we can see if that would help your

performance.

  Quincey

In your tests you only mention the chunk size, but not the chunk

shape.

Isn't that important? It gives me the impression that in your

tests

the

data are stored and accessed fully sequentially which makes the

cache

useless.

Yes, chunk shape is important, sorry, I forgot this important

detail. As I

mentioned in a previous message to Rob Latham, I want to optimize

'semi-

random' access mode in a certain row of the dataset, so I

normally

choose

the

chunk shape as (1, X), where X is the needed value for obtaining

a

chunksize

between 1 KB and 8 MB --if

X is larger than the maximum number of

columns, I

expand the number of rows in the chunk shape accordingly.

Thanks,

--
Francesc Alted

<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________

_____

___________

Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

<tHDF5.cc>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org