Hi Andrew,
What you are seeing is "expected" (just maybe not to you! :-).
Because you haven't defined a fill value and haven't changed the fill time
from H5D_FILL_TIME_IFSET to H5D_FILL_TIME_ALLOC, the HDF5 library won't
touch your buffer when a chunk doesn't have any data written to it - you
haven't told it to.
Thanks for the clarification! So that I'm sure I understand this...
the expected behavior for chunked datasets with (1) no fill value
explicitly set, and (2) with the fill time set to H5D_FILL_TIME_IFSET
(the default), is that portions of the destination buffer
corresponding to uninitialized chunks is not touched?
There are still some behaviors which confuse me... I've attached an
(even simpler) C program. Apologies for the long email.
1. For both contiguous and chunked datasets, reading from an
uninitialized dataset (i.e. just created) returns 0 for every element.
We should probably make this to be consistent for the chunked dataset case - an uninitialized dataset should not change the buffer. (I'm not arguing whether this is correct, but just to get it consistent)
2. For chunked datasets, after writing anything at all to any portion
of the dataset, the behavior you described kicks in; portions of the
buffer corresponding to uninitialized chunks are not touched.
This is "expected" behavior.
3. For contiguous datasets, a default fill value of 0 continues to be
provided for uninitialized sections, no matter what I do.
The HDF5 library doesn't write those 0's to the uninitialized sections of the dataset, they are set to 0 by the operating system.
I don't understand why the behavior for regions of the dataset which
haven't been explicitly initialized is so radically different for the
chunked and contiguous cases. If I'm reading from a chunked dataset
to a destination selection, the parts of the selection corresponding
to uninitialized chunks in the file will be silently skipped. For
contiguous datasets, these sections always have the user-defined fill
value, or 0.
Again - the HDF5 library isn't being inconsistent for the contiguous datasets, the file system is filling in those 0 values. You will see the same behavior for partially written chunks (without fill values).
I can't really think of a case in which the current skipping behavior
is beneficial, considering it applies to an arbitrary (how do I tell
what chunks are "real"?) subset of the destination buffer. This
becomes a problem in the case of complex selections, where it isn't
feasible to memset the destination selection before reading. I don't
understand why when I ask for a selection from a dataset, HDF5 would
ever skip any of it. If I wanted to leave part of my buffer
unmodified, I can simply not select it!
I understand your frustration with this behavior, but what is the HDF5 library supposed to do? You haven't given a fill-value and you've left the "ifset" fill-value behavior so the library doesn't have any values for that chunk - there's literally nothing to give you. You just can't detect the problem for contiguous datasets...
For a concrete example, let's say I have an existing 16-element array
in memory containing some data:
XXXX XXXX XXXX XXXX
Now I want to update the first 8 elements (of 16) by reading from a
dataset. Coincidentally, the person who created the dataset only
wrote to the first 4 elements ("." is an unwritten element) and did
not explicitly set a fill value:
YYYY .... .... ....
When the dataset has contiguous storage, this is the result:
YYYY 0000 XXXX XXXX
When the dataset has chunked storage (and the default options), this
is the result:
YYYY XXXX XXXX XXXX
However, if I update the first 8 elements from a *completely
uninitialized dataset* (both contiguous and chunked) this is the
result:
0000 0000 XXXX XXXX
From the perspective of someone reading a dataset which has already
been created, how do I tell HDF5 that I always want the fill value (or
0) applied? Is there some way to set the "read-time" fill strategy?
How do can I force the "contiguous-style" behavior when reading from a
chunked dataset created with the default options?
I certainly understand your desire for addressing this issue, particularly since, as you say, there's no way for applications to determine which chunks are allocated.
Here's the most obvious options that occur to me:
1 - Leave things alone - the application queries for the fill-value and the fill-time and if the combination indicates that there could be an issue with missing chunks, the application pre-fills the buffer with the fill-value of their choice. I don't like this choice very much, since an application would have to be conservative and may end up doing a lot of work for no benefit (if it pre-fills and then all the chunks do exist).
2 - Make H5Dread() return an error when attempting to read a non-existent chunk if there's no fill value available. This would break existing programs, so I'm just including it for completeness, I don't think we should do this.
3 - Make a new dataset transfer property for filling in the values of missing chunks (which is similar to, but not the same as the role played by fill values). This could be taken a step further and stored with the dataset's metadata, so it persists from application to application. I think this might be too subtle of a distinction for most application developers/users.
4 - Change how fill values operate, so that the contiguous dataset's behavior is mimicked (with zeroes used on a read of missing chunks when there's no fill value and the fill-time is "ifset"). I'm a little reluctant about this solution for two reasons: the zeroes are just an arbitrary choice for operating systems to use for unwritten bytes in a file, and we'd be [partially] modifying existing behavior (although maybe the existing behavior is a bug?).
I think I'm leaning toward option #4, but are there any opinions from others on the forum?
Quincey
···
On Jul 20, 2009, at 4:18 PM, Andrew Collette wrote: