Hi Fransesc,
Why does HDF5 have to reserve space for each IO operation? Space is only
needed for the cache and the buffer supplied by the caller. The cache
space can be allocated only once because all chunks have the same size.
I would expect the IO and cache to work like:
The cache contains uncompressed chunks (otherwise the system's cache can
be used as well).
Reading (or writing) is done like:
- Iterate over the data to be read and determine which chunks to read
- Per chunk check if in the cache. If so, copy the required data part to
the supplied buffer.
- If not in cache, read that chunk into a (pre-allocated) buffer,
uncompress to a free slot in the cache, and copy to supplied buffer. If
no free slot, it has to remove a chunk from the cache using e.g. a
least-recently-used algorithm.
- If the data are not compressed (as in my case), it can read directly
into the cache slot.
If it is not working that way, I probably have a very incorrect view of
the purpose of the HDF5 cache. Quincey may be able to shed some light.
I noticed in your pictures that apart from HF5L_reg_malloc and _free,
also a lot of calls to H5D-btree-cmp3 are done. I assume these can be
expensive calls. Such calls are expected and therefore I was wondering
if they are the culprit. However, I don't understand why it is using a
btree comparison, because the documentation says that the chunk cache
uses a hash algorithm, not a btree to find chunks in the cache.
Note that casacore (that we are using as well), also supports chunking
and caching. It is much faster for the smaller IO operations, so it is
an implementation issue.
Cheers,
Ger
Francesc Alted 01/11/10 7:07 AM >>>
Hi Ger,
A Friday 08 January 2010 08:25:09 escriguéreu:
Hi Francesc,
This might be related to a problem I reported last June.
I did tests using a 3-dim array with various chunk shapes and access
patterns. It got very slow when iterating through the data by vector
in
the Z-direction. I believe it was filed as a bug by the HDF5 group. I
sent a test program to Quincey that shows the behaviour. I'll forward
that mail and the test program to you, so you can try it out yourself
if
you like to.
I suspect the cache lookup algorithm to be the culprit. The larger the
cache and the more often it has to look up, the slower things get.
BTW,
Did you adapt the cache's hash size to the number of slots in the
cache?
Thanks for your suggestion. I've been looking at your problem, and my
profiles seem to say that it is not a cache issue.
Have a look at the attached screenshots showing profiles for your test
bed
reading in the x axis with a cache size of 4 KB (the HDF5 cache
subsystem does
not enters in action at all) and 256 KB (your size). I've also added a
profile for the tiled case for comparison purposes. For all the
profiles
(except tiled), the bottleneck is clearly in the `H5FL_reg_free` and
`H5FL_reg_malloc` calls, no matter how large the cache size is (even if
it
does not enters in action).
I think this is expected, because HDF5 has to reserve space for each I/O
operation. When you walk the dataset following directions x or y, you
have to
do (32*2)x more I/O operations than for the tiled case, and HDF5 needs
to book
(and free again!) (32*2)x more memory areas. Also, when you read
through the
z axis, the additional times to book/release memory is (32*32)x. All of
this
is consistent with both profiles and running the benchmark manually:
faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
real 0m0.057s
user 0m0.048s
sys 0m0.004s
faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
setting cache to 32 chunks (4096 bytes) with 3203 slots // forcing no
cache
real 0m1.055s
user 0m0.860s
sys 0m0.168s
faltet@antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
setting cache to 32 chunks (262144 bytes) with 3203 slots
real 0m1.211s
user 0m1.176s
sys 0m0.028s
faltet@antec:/tmp>
time ./tHDF5 1024 1024 10 32 32 2sys 0m0.024s
So, in my opinion, there is little that HDF5 can do here. You should
better
adapt the chunk shape to your most used case (if you have just one, but
I know
that this is not typically the case).
In your tests you only mention the chunk size, but not the chunk
shape.
Isn't that important? It gives me the impression that in your tests
the
data are stored and accessed fully sequentially which makes the cache
useless.
Yes, chunk shape is important, sorry, I forgot this important detail.
As I
mentioned in a previous message to Rob Latham, I want to optimize 'semi-
random' access mode in a certain row of the dataset, so I normally
choose the
chunk shape as (1, X), where X is the needed value for obtaining a
chunksize
between 1 KB and 8 MB --if X is larger than the maximum number of
columns, I
expand the number of rows in the chunk shape accordingly.
Thanks,
···
--
Francesc Alted