Hi Dimitris,
I do not want to change the chunk size; that is fixed once you've
created a dataset.
What I do want to be able to define dynamically is the chunk CACHE
size.
Suppose you have a dataset array z,y,x with shape [1000,1000,1000] and
chunk size [10,10,100]
- when stepping through the array with a cursor shape of [10,10,100]
the cache needs to contain one chunk only as it matchs the chunk shape.
- when stepping through the array with cursor [1,1,1000], the cache
should contain 10 chunks. This is because you need 10 chunks to get the
full vector and you want those chunks to be kept in memory. However,
this is only true if you step through the array in optimal order (thus
10 steps in y, thereafter 10 in z, etc.). When stepping naievely like:
for (iz=0; iz<1000; ++iz)
for (iy=0; iy<1000; ++iy)
read hyperslab of shape [1,1,1000] at position [iz,iy,1000]
the cache should be 100 times bigger or you have to accept that the
same chunk gets read 10 times.
I wonder how current HDF5 users do these kind of things? Are their
arrays so small that they can always be held in memory? Or do they
accept that the cache is too small and is thrashing?
Probably this kind of iteration functionality should be put in a layer
on top of HDF5. But HDF5 should provide the means to set the chunk cache
size dynamically.
I don't know if sec2 using mmap. I thought it uses unbuffered IO only.
Cheers,
Ger
"Dimitris Servis" <servisster@gmail.com> 03/20/08 12:10 PM >>>
Hi Ger,
this is a very interesting issue indeed and I have similar datasets as
well.
However, there are some things I am not sure I understand completely:
1) As far as I can tell, chunk size relates more to optimal read/write
strategy for the dataset. This means that if the dataset is resizable,
I
vary my chunk size according to use cases (or access patterns) like:
(a)
write once a fixed dataset and read frequently (b) write once a
variable
dataset and read frequently (c) update/resize a lot and read seldom or
rarely and further refinements, depending on whether each action takes
place
in a local or remote machine, whether resizing takes place and
whether new
size can be foreseen, if there is a different leading dimension when
reading
and writing and so on. But chunking relates to the allocated blocks in
the
disk AFAIK. For a resizable dataset it is clear that this is set at
creation
and cannot be changed, as chunks may be scattered within the file. If
you
change the access pattern at that point and want to change the chunk
size,
wouldn't you have to rewrite the whole dataset?
2) H5Pset_buffer can be used with the iterator you described, in order
to
accommodate the slab selection (or any multitude). Of course this is
an
application dependent strategy and IMHO it is better that it is not
predefined by the library. Therefore I write my selection iterators at
a
higher level, and can adjust my strategy depending on the urgency of
the
task.
3) If I recall correctly, sec2 driver does use memory mapping anyway.
Am I missing something?
BR
-- dimitris
Hi Quincey,
It should be clear that this discussion is for very large arrays not
fitting in memory. Otherwise you may as well read the entire array
and do
operations in memory.
IMHO HDF5 cannot prescribe access patterns; it can advise them
though.
Furthermore it could define or let the user set a maximum cache size
(which
could be really large, potentially several GB on machines with a lot
of
memory.
The best cursor shape is the chunk shape. So if an application does
not
care about order (e.g. to determine the min/max of an array), it
should
use that one.
We have applications where we need to determine the median for each
vector
or plane in a cube. This can be a vector or plane in any direction,
so my
cursor shape can be (nx,1,1) or (1,ny,1), etc. It would be nice if I
could
write such a loop as (in C++ terms):
DataSetIterator iter(cursorshape);
while (iter.next()) {
iter.data() gives pointer to data
iter.position() gives blc of current cursor position
}
In this way HDF5 can execute the iteration in the optimal order (and
set
the cache size for me) without the user having to worry about it.
It would also be nice if the chunk cache keeps statistics which I
can
display on demand (showing the nr of reads, writes, cache hits) to
see if
the cache behaves as expected.
Another option would be to mmap the dataset, so the OS will do the
caching
for you. Of course, only on systems where it is possible. Probably it
makes
life much easier for you.
Cheers,
Ger
>>> Quincey Koziol <koziol@hdfgroup.org> 03/18/08 4:05 PM >>>
Hi Ger,
> I'm using HDF5 to hold 3- or 4-dim data arrays (which can be
several
> GBytes). The access patterns to the data can vary (even within one
> application), so the data are stored in a chunked way.
Unfortunately
> control over the chunk cache size seems to be very limited.
> I would like to be able to set the cache size each time the access
> pattern changes. However, as far as I know I can only set the
cache
> size before opening (or creating) the file. It is not even
possible
> to set it per dataset.
Yes, that's currently true. We're in the process of revising
the
chunk cache itself and I'm guessing that we'll want to revise the
API
that controls it also.
> For the time being I define the cache size as 16 MBytes, but it is
> bound to mismatch some access patterns resulting in a great
> performance loss.
>
> What I would like is the ability to set the cache size per dataset
> in a dynamic way. Thus not statically before opening the file or
> dataset, but at any time.
> Is it possible to add that to HDF5? Haven't other people felt that
> need? To me this seems quite fundamental, otherwise you cannot
take
> full advantage of chunking.
> What I would like most is that I can tell HDF5 the access pattern
> (i.e. cursor size and in which order it iterates over the axes)
and
> let it sort out the optimal cache size.
Hmm, I hadn't thought about that sort of replacement
algorithm for
evicting the chunks in the cache - I was planning on implementing an
LRU algorithm with a fixed-size cache. I would be worried about
allowing an application to potentially pick a "bad" order over the
axes and end up with a very large numbers of chunks cached.
I'm very interested in hearing any ideas you (or others)
might
have
for how you think the eviction algorithm could work and the API
needed
···
2008/3/20, Ger van Diepen <diepen@astron.nl>:
On Mar 18, 2008, at 3:08 AM, Ger van Diepen wrote:
to control it.
Quincey
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.
--
- What is the difference between mechanical engineers and civil
engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can
die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn
Your
Butt Tomorrow
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.