Fransesc et al:
Just to elaborate a little bit on Quincey's
"slicing the wrong way" explanation. (I hope I'm not just confusing
matters.)
If possible you want to design the shape of the
chunk so that you get the most useful data with
the fewest number of accesses. If accesses are
mostly contiguous elements along a certain
dimension, you shape the chunk to contain the
most elements along that dimension. If accesses
are random shapes and sizes, then it gets a
little tricky -- we just generally recommend a
square (cube, etc.), but that may not be as good
as, say, a shape that has the same proportions as your dataset.
So, for instance if your dataset is 3,000x6,000
(3,000 rows, 6,000 columns) and you always access
a single column, then each chunk should contain
as much of a column as possible, given your best
chunk size. If we assume a good chunk size is
600 elements, then your chunks would all be
600x1, and accessing any column in its entirety
would take 10 accesses. Having each chunk be a
part of a row (1x600) would give you the worst
performance in this case, since you'd need to
access 6,000 chunks to access a column.
If accesses are unpredictable, perhaps a chunk
size of 30x60 would be best, as your worst case
performance (for reading a single column or row)
would take 100 accesses. (By worst case, I'm
thinking of the case where you have to do the
most accesses per useful data element.)
In other cases, such as when you slice it one way
you don't care about performance, but when you
slice it another way you really do, would call
for a chunk to be shaped accordingly.
Mike
At 11:01 AM 12/4/2007, Quincey Koziol wrote:
>Hi Francesc,
>
>On Dec 3, 2007, at 11:21 AM, Francesc Altet wrote:
>>A Monday 03 December 2007, Francesc Altet escrigué:
>>>Ups, I've ended with a similar program and send it to the
>>>hdf-forum@hdfgroup.org list past Saturday. I'm attaching my own
>>>version (which is pretty similar to yours). Sorry for not sending
>>>you a copy of my previous message, because it could saved you some
>>>work :-/
>>
>>Well, as Ivan pointed out, a couple of glitches slipped in my
>> program. I'm attaching the correct version, but the result is the
>> same, i.e. when N=600. I'm getting a segfault both under HDF5
>> 1.6.5 and 1.8.0 beta5.
>
> I was able to duplicate the segfault
> with your program, but it was a
>stack overflow and if you move the "data" array out of main() and
>make it a global variable, things run to completion without error.
>It's _really_ slow and chews _lots_ of memory still (because you are
>slicing the dataset the "wrong" way), but everything seems to be
>working correctly.
>
> It's somewhat hard to fix the "slicing the wrong way"
> problem, because the library is building a list of all the chunks
> that will be affected by each I/O operation (so that we can do all
> the I/O on each chunk at once) and that has some memory issues when
> dealing with I/O operations that affect so many chunks at once
> right now. Building a list of all the affected chunks is good for
> the parallel I/O case, but could be avoided in the serial I/O case,
> I think. However, that would probably make the code difficult to
> maintain... :-/
>
> You could try adjusting the chunk cache size larger, which
> would probably help, if you make it large enough to hold all the
> chunks for the dataset.
>
> Quincey
>
>
>--------------------------------------------------------------------
>-- This mailing list is for HDF software users discussion.
>To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
> hdf-forum-unsubscribe@hdfgroup.org.
--
Mike Folk The HDF Group http://hdfgroup.org 217.244.0647
1901 So. First St., Suite C-2, Champaign IL 61820
---------------------------------------------------------------------
- This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.
V V Cárabos Coop. V. Enjoy Data
"-"