Resizing a dataset breaks fill value

andrew.collette · July 17, 2009, 7:24pm

Hi,

I just reproduced in C an issue I noticed with my Python project. If
you create a chunked dataset and resize it (H5Dset_extent)
small->big->small->big, the correct fill value is not applied for
subsequent reads. For example, if I create a 4x4 dataset (2x2 chunks)
and expand it to 16x16, the new space has a fill value of 0. But if
I resize it back to 4x4 and then back up to 16x16 again, only the 4x4
patch is still correct; the rest of the buffer is not even touched.

A workaround is to set the fill time to H5D_FILL_TIME_ALLOC, however,
I'm not sure what side effects this has. The documentation for
H5Dset_extent specifically says that H5D_FILL_TIME_IFSET (the default)
should work.

I'm ccing hdf-forum in case other people have this problem.

Thanks,
Andrew Collette

chunk_glitch.c (2.25 KB)

Quincey_Koziol · July 20, 2009, 7:53pm

Hi Andrew,

Hi,

I just reproduced in C an issue I noticed with my Python project. If
you create a chunked dataset and resize it (H5Dset_extent)
small->big->small->big, the correct fill value is not applied for
subsequent reads. For example, if I create a 4x4 dataset (2x2 chunks)
and expand it to 16x16, the new space has a fill value of 0. But if
I resize it back to 4x4 and then back up to 16x16 again, only the 4x4
patch is still correct; the rest of the buffer is not even touched.

A workaround is to set the fill time to H5D_FILL_TIME_ALLOC, however,
I'm not sure what side effects this has. The documentation for
H5Dset_extent specifically says that H5D_FILL_TIME_IFSET (the default)
should work.

I'm ccing hdf-forum in case other people have this problem.

What you are seeing is "expected" (just maybe not to you! :-). Because you haven't defined a fill value and haven't changed the fill time from H5D_FILL_TIME_IFSET to H5D_FILL_TIME_ALLOC, the HDF5 library won't touch your buffer when a chunk doesn't have any data written to it - you haven't told it to.

I agree with you about the documentation for H5Dset_extent(), we should change it from:

"Fill values will be written to the dataset if the dataset’s fill time is set to H5D_FILL_TIME_IFSET or H5D_FILL_TIME_ALLOC."

to something like:

"Fill values will be written to the dataset if the dataset’s fill time is set to H5D_FILL_TIME_IFSET (and a fill value is defined) or H5D_FILL_TIME_ALLOC."

I'll file a documentation bug for this, but H5Dset_extent()- >H5Dread() are following the correct "rules" for chunks with no data and no fill values written.

BTW, The fill values aren't getting written to the chunks when you expand them in both cases, you just happened to fill your array with 0's initially, so it looked like they were being written. I've attached an updated version of your example program that shows that issues more clearly, along with output when a fill value is defined and not defined.

I hope this clarifies the issue,
Quincey

chunk_glitch.c (2.41 KB)

no_fill.txt (7.24 KB)

with_fill.txt (7.24 KB)

···

On Jul 17, 2009, at 2:24 PM, Andrew Collette wrote:

andrew.collette · July 20, 2009, 9:18pm

   What you are seeing is &quot;expected&quot; \(just maybe not to you\! :\-\)\.
Because you haven't defined a fill value and haven't changed the fill time
from H5D_FILL_TIME_IFSET to H5D_FILL_TIME_ALLOC, the HDF5 library won't
touch your buffer when a chunk doesn't have any data written to it - you
haven't told it to.

Thanks for the clarification! So that I'm sure I understand this...
the expected behavior for chunked datasets with (1) no fill value
explicitly set, and (2) with the fill time set to H5D_FILL_TIME_IFSET
(the default), is that portions of the destination buffer
corresponding to uninitialized chunks is not touched?

There are still some behaviors which confuse me... I've attached an
(even simpler) C program. Apologies for the long email.

1. For both contiguous and chunked datasets, reading from an
uninitialized dataset (i.e. just created) returns 0 for every element.
2. For chunked datasets, after writing anything at all to any portion
of the dataset, the behavior you described kicks in; portions of the
buffer corresponding to uninitialized chunks are not touched.
3. For contiguous datasets, a default fill value of 0 continues to be
provided for uninitialized sections, no matter what I do.

I don't understand why the behavior for regions of the dataset which
haven't been explicitly initialized is so radically different for the
chunked and contiguous cases. If I'm reading from a chunked dataset
to a destination selection, the parts of the selection corresponding
to uninitialized chunks in the file will be silently skipped. For
contiguous datasets, these sections always have the user-defined fill
value, or 0.

I can't really think of a case in which the current skipping behavior
is beneficial, considering it applies to an arbitrary (how do I tell
what chunks are "real"?) subset of the destination buffer. This
becomes a problem in the case of complex selections, where it isn't
feasible to memset the destination selection before reading. I don't
understand why when I ask for a selection from a dataset, HDF5 would
ever skip any of it. If I wanted to leave part of my buffer
unmodified, I can simply not select it!

For a concrete example, let's say I have an existing 16-element array
in memory containing some data:

XXXX XXXX XXXX XXXX

Now I want to update the first 8 elements (of 16) by reading from a
dataset. Coincidentally, the person who created the dataset only
wrote to the first 4 elements ("." is an unwritten element) and did
not explicitly set a fill value:

YYYY .... .... ....

When the dataset has contiguous storage, this is the result:

YYYY 0000 XXXX XXXX

When the dataset has chunked storage (and the default options), this
is the result:

YYYY XXXX XXXX XXXX

However, if I update the first 8 elements from a *completely
uninitialized dataset* (both contiguous and chunked) this is the
result:

0000 0000 XXXX XXXX

From the perspective of someone reading a dataset which has already

been created, how do I tell HDF5 that I always want the fill value (or
0) applied? Is there some way to set the "read-time" fill strategy?
How do can I force the "contiguous-style" behavior when reading from a
chunked dataset created with the default options?

Thanks,
Andrew

chunk_glitch.c (1.78 KB)

Quincey_Koziol · July 20, 2009, 10:08pm

Hi Andrew,

What you are seeing is "expected" (just maybe not to you! :-).
Because you haven't defined a fill value and haven't changed the fill time
from H5D_FILL_TIME_IFSET to H5D_FILL_TIME_ALLOC, the HDF5 library won't
touch your buffer when a chunk doesn't have any data written to it - you
haven't told it to.

Thanks for the clarification! So that I'm sure I understand this...
the expected behavior for chunked datasets with (1) no fill value
explicitly set, and (2) with the fill time set to H5D_FILL_TIME_IFSET
(the default), is that portions of the destination buffer
corresponding to uninitialized chunks is not touched?

There are still some behaviors which confuse me... I've attached an
(even simpler) C program. Apologies for the long email.

1. For both contiguous and chunked datasets, reading from an
uninitialized dataset (i.e. just created) returns 0 for every element.

We should probably make this to be consistent for the chunked dataset case - an uninitialized dataset should not change the buffer. (I'm not arguing whether this is correct, but just to get it consistent)

2. For chunked datasets, after writing anything at all to any portion
of the dataset, the behavior you described kicks in; portions of the
buffer corresponding to uninitialized chunks are not touched.

This is "expected" behavior.

3. For contiguous datasets, a default fill value of 0 continues to be
provided for uninitialized sections, no matter what I do.

The HDF5 library doesn't write those 0's to the uninitialized sections of the dataset, they are set to 0 by the operating system.

I don't understand why the behavior for regions of the dataset which
haven't been explicitly initialized is so radically different for the
chunked and contiguous cases. If I'm reading from a chunked dataset
to a destination selection, the parts of the selection corresponding
to uninitialized chunks in the file will be silently skipped. For
contiguous datasets, these sections always have the user-defined fill
value, or 0.

Again - the HDF5 library isn't being inconsistent for the contiguous datasets, the file system is filling in those 0 values. You will see the same behavior for partially written chunks (without fill values).

I can't really think of a case in which the current skipping behavior
is beneficial, considering it applies to an arbitrary (how do I tell
what chunks are "real"?) subset of the destination buffer. This
becomes a problem in the case of complex selections, where it isn't
feasible to memset the destination selection before reading. I don't
understand why when I ask for a selection from a dataset, HDF5 would
ever skip any of it. If I wanted to leave part of my buffer
unmodified, I can simply not select it!

I understand your frustration with this behavior, but what is the HDF5 library supposed to do? You haven't given a fill-value and you've left the "ifset" fill-value behavior so the library doesn't have any values for that chunk - there's literally nothing to give you. You just can't detect the problem for contiguous datasets...

For a concrete example, let's say I have an existing 16-element array
in memory containing some data:

XXXX XXXX XXXX XXXX

Now I want to update the first 8 elements (of 16) by reading from a
dataset. Coincidentally, the person who created the dataset only
wrote to the first 4 elements ("." is an unwritten element) and did
not explicitly set a fill value:

YYYY .... .... ....

When the dataset has contiguous storage, this is the result:

YYYY 0000 XXXX XXXX

When the dataset has chunked storage (and the default options), this
is the result:

YYYY XXXX XXXX XXXX

However, if I update the first 8 elements from a *completely
uninitialized dataset* (both contiguous and chunked) this is the
result:

0000 0000 XXXX XXXX

From the perspective of someone reading a dataset which has already
been created, how do I tell HDF5 that I always want the fill value (or
0) applied? Is there some way to set the "read-time" fill strategy?
How do can I force the "contiguous-style" behavior when reading from a
chunked dataset created with the default options?

I certainly understand your desire for addressing this issue, particularly since, as you say, there's no way for applications to determine which chunks are allocated.

Here's the most obvious options that occur to me:

1 - Leave things alone - the application queries for the fill-value and the fill-time and if the combination indicates that there could be an issue with missing chunks, the application pre-fills the buffer with the fill-value of their choice. I don't like this choice very much, since an application would have to be conservative and may end up doing a lot of work for no benefit (if it pre-fills and then all the chunks do exist).

2 - Make H5Dread() return an error when attempting to read a non-existent chunk if there's no fill value available. This would break existing programs, so I'm just including it for completeness, I don't think we should do this.

3 - Make a new dataset transfer property for filling in the values of missing chunks (which is similar to, but not the same as the role played by fill values). This could be taken a step further and stored with the dataset's metadata, so it persists from application to application. I think this might be too subtle of a distinction for most application developers/users.

4 - Change how fill values operate, so that the contiguous dataset's behavior is mimicked (with zeroes used on a read of missing chunks when there's no fill value and the fill-time is "ifset"). I'm a little reluctant about this solution for two reasons: the zeroes are just an arbitrary choice for operating systems to use for unwritten bytes in a file, and we'd be [partially] modifying existing behavior (although maybe the existing behavior is a bug?).

I think I'm leaning toward option #4, but are there any opinions from others on the forum?

Quincey

···

On Jul 20, 2009, at 4:18 PM, Andrew Collette wrote:

Quincey_Koziol · July 20, 2009, 10:27pm

Replying to myself after talking to Elena...

Hi Andrew,

      What you are seeing is "expected" (just maybe not to you! :-).
Because you haven't defined a fill value and haven't changed the fill time
from H5D_FILL_TIME_IFSET to H5D_FILL_TIME_ALLOC, the HDF5 library won't
touch your buffer when a chunk doesn't have any data written to it - you
haven't told it to.

Thanks for the clarification! So that I'm sure I understand this...
the expected behavior for chunked datasets with (1) no fill value
explicitly set, and (2) with the fill time set to H5D_FILL_TIME_IFSET
(the default), is that portions of the destination buffer
corresponding to uninitialized chunks is not touched?

There are still some behaviors which confuse me... I've attached an
(even simpler) C program. Apologies for the long email.

1. For both contiguous and chunked datasets, reading from an
uninitialized dataset (i.e. just created) returns 0 for every element.

  We should probably make this to be consistent for the chunked dataset case - an uninitialized dataset should not change the buffer. (I'm not arguing whether this is correct, but just to get it consistent)

2. For chunked datasets, after writing anything at all to any portion
of the dataset, the behavior you described kicks in; portions of the
buffer corresponding to uninitialized chunks are not touched.

  This is "expected" behavior.

3. For contiguous datasets, a default fill value of 0 continues to be
provided for uninitialized sections, no matter what I do.

  The HDF5 library doesn't write those 0's to the uninitialized sections of the dataset, they are set to 0 by the operating system.

I don't understand why the behavior for regions of the dataset which
haven't been explicitly initialized is so radically different for the
chunked and contiguous cases. If I'm reading from a chunked dataset
to a destination selection, the parts of the selection corresponding
to uninitialized chunks in the file will be silently skipped. For
contiguous datasets, these sections always have the user-defined fill
value, or 0.

  Again - the HDF5 library isn't being inconsistent for the contiguous datasets, the file system is filling in those 0 values. You will see the same behavior for partially written chunks (without fill values).

I can't really think of a case in which the current skipping behavior
is beneficial, considering it applies to an arbitrary (how do I tell
what chunks are "real"?) subset of the destination buffer. This
becomes a problem in the case of complex selections, where it isn't
feasible to memset the destination selection before reading. I don't
understand why when I ask for a selection from a dataset, HDF5 would
ever skip any of it. If I wanted to leave part of my buffer
unmodified, I can simply not select it!

  I understand your frustration with this behavior, but what is the HDF5 library supposed to do? You haven't given a fill-value and you've left the "ifset" fill-value behavior so the library doesn't have any values for that chunk - there's literally nothing to give you. You just can't detect the problem for contiguous datasets...

For a concrete example, let's say I have an existing 16-element array
in memory containing some data:

XXXX XXXX XXXX XXXX

Now I want to update the first 8 elements (of 16) by reading from a
dataset. Coincidentally, the person who created the dataset only
wrote to the first 4 elements ("." is an unwritten element) and did
not explicitly set a fill value:

YYYY .... .... ....

When the dataset has contiguous storage, this is the result:

YYYY 0000 XXXX XXXX

When the dataset has chunked storage (and the default options), this
is the result:

YYYY XXXX XXXX XXXX

However, if I update the first 8 elements from a *completely
uninitialized dataset* (both contiguous and chunked) this is the
result:

0000 0000 XXXX XXXX

From the perspective of someone reading a dataset which has already
been created, how do I tell HDF5 that I always want the fill value (or
0) applied? Is there some way to set the "read-time" fill strategy?
How do can I force the "contiguous-style" behavior when reading from a
chunked dataset created with the default options?

  I certainly understand your desire for addressing this issue, particularly since, as you say, there's no way for applications to determine which chunks are allocated.

  Here's the most obvious options that occur to me:

    1 - Leave things alone - the application queries for the fill-value and the fill-time and if the combination indicates that there could be an issue with missing chunks, the application pre-fills the buffer with the fill-value of their choice. I don't like this choice very much, since an application would have to be conservative and may end up doing a lot of work for no benefit (if it pre-fills and then all the chunks do exist).

    2 - Make H5Dread() return an error when attempting to read a non-existent chunk if there's no fill value available. This would break existing programs, so I'm just including it for completeness, I don't think we should do this.

    3 - Make a new dataset transfer property for filling in the values of missing chunks (which is similar to, but not the same as the role played by fill values). This could be taken a step further and stored with the dataset's metadata, so it persists from application to application. I think this might be too subtle of a distinction for most application developers/users.

    4 - Change how fill values operate, so that the contiguous dataset's behavior is mimicked (with zeroes used on a read of missing chunks when there's no fill value and the fill-time is "ifset"). I'm a little reluctant about this solution for two reasons: the zeroes are just an arbitrary choice for operating systems to use for unwritten bytes in a file, and we'd be [partially] modifying existing behavior (although maybe the existing behavior is a bug?).

5 - Track the elements in the dataset that have been written to by an application and provide that information to application (as a selection). Possibly also use this information for use during I/O? This could use a lot of extra space in the file...

···

On Jul 20, 2009, at 5:08 PM, Quincey Koziol wrote:

On Jul 20, 2009, at 4:18 PM, Andrew Collette wrote:

I think I'm leaning toward option #4, but are there any opinions from others on the forum?

Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

andrew.collette · July 21, 2009, 12:01am

Hi Quincey,

Thanks for such a detailed response! I think it is useful to
investigate how to address this issue, as the current behavior is
rather surprising. In fact, I first became aware of it after one of
my users started getting unexpected data when reading from a chunked
dataset.

Again - the HDF5 library isn't being inconsistent for the contiguous
datasets, the file system is filling in those 0 values. You will see the
same behavior for partially written chunks (without fill values).

Wow, I didn't realize this was the origin of the current behavior.
This seems especially undesirable; in the case of an absent fill
value, not only are parts of the buffer left uninitialized, other
parts are overwritten with zeros from the OS. This makes it both
necessary to initialize the destination selection, and impossible to
use any value other than 0.

           1 \- Leave things alone \- the application queries for the
fill-value and the fill-time and if the combination indicates that there
could be an issue with missing chunks, the application pre-fills the buffer
with the fill-value of their choice. I don't like this choice very much,
since an application would have to be conservative and may end up doing a
lot of work for no benefit (if it pre-fills and then all the chunks do
exist).

Yes, this would require using H5Diterate or similar for arbitrary
selections, which seems like a lot of work, and slow for large
selections.

           2 \- Make H5Dread\(\) return an error when attempting to read
a non-existent chunk if there's no fill value available. This would break
existing programs, so I'm just including it for completeness, I don't think
we should do this.

Agreed.

           4 \- Change how fill values operate, so that the contiguous
dataset's behavior is mimicked (with zeroes used on a read of missing chunks
when there's no fill value and the fill-time is "ifset"). I'm a little
reluctant about this solution for two reasons: the zeroes are just an
arbitrary choice for operating systems to use for unwritten bytes in a file,
and we'd be [partially] modifying existing behavior (although maybe the
existing behavior is a bug?).

I think the fundamental issue is that HDF5 needs to define a default
fill value, and all-zeros seems like the most straightforward default.
Are there any supported operating systems which do not return zero
for reads like this? I also think there's exactly one case in which
elements in the destination buffer should be left unmodified during a
read, and that's when they're not selected.

It's possible that application behavior may change, but if that's the
case I can't imagine the current behavior is beneficial (or even
reproducible); it's like relying on uninitialized memory. Currently
there is literally no way to reliably predict what parts get data,
what parts get 0 as a side effect of partially-initialized chunks, and
which are left alone.

It would also have the interesting side effect that you could throw
away chunks which have a value of all zero.

           5 \- Track the elements in the dataset that have been written
to by an application and provide that information to application (as a
selection). Possibly also use this information for use during I/O? This
could use a lot of extra space in the file...

This also seems to have the same drawbacks as (1) if exposed to the user.

Thanks again,
Andrew

Quincey_Koziol · July 21, 2009, 12:11pm

Hi Andrew,

Hi Quincey,

Thanks for such a detailed response! I think it is useful to
investigate how to address this issue, as the current behavior is
rather surprising. In fact, I first became aware of it after one of
my users started getting unexpected data when reading from a chunked
dataset.

Yes, we've received a few bug reports about this issue over the years and it would be nice to resolve it in a less-surprising way.

4 - Change how fill values operate, so that the contiguous
dataset's behavior is mimicked (with zeroes used on a read of missing chunks
when there's no fill value and the fill-time is "ifset"). I'm a little
reluctant about this solution for two reasons: the zeroes are just an
arbitrary choice for operating systems to use for unwritten bytes in a file,
and we'd be [partially] modifying existing behavior (although maybe the
existing behavior is a bug?).

I think the fundamental issue is that HDF5 needs to define a default
fill value, and all-zeros seems like the most straightforward default.
Are there any supported operating systems which do not return zero
for reads like this? I also think there's exactly one case in which
elements in the destination buffer should be left unmodified during a
read, and that's when they're not selected.

Well, there is a default fill-value (all zeroes), but we just chose not to store it by default (i.e. the default fill-time is "ifset" rather than "alloc") for performance reasons. POSIX file systems require that unwritten bytes in a file act as if zeroes were written there, and pretty much all we support now are POSIX systems. It used to be that Windows would return whatever bytes were stored on disk instead of zeroes, but all the current versions now return zeroes.

It's possible that application behavior may change, but if that's the
case I can't imagine the current behavior is beneficial (or even
reproducible); it's like relying on uninitialized memory. Currently
there is literally no way to reliably predict what parts get data,
what parts get 0 as a side effect of partially-initialized chunks, and
which are left alone.

Yes, I agree - it's a problem that we need to solve for users.

It would also have the interesting side effect that you could throw
away chunks which have a value of all zero.

Yes, that would be a nice optimization, but it could require scanning each chunk when it was written, which would be a performance issue. It might make a nice option for h5repack though...

5 - Track the elements in the dataset that have been written
to by an application and provide that information to application (as a
selection). Possibly also use this information for use during I/O? This
could use a lot of extra space in the file...

This also seems to have the same drawbacks as (1) if exposed to the user.

Yes, it could require a lot of work from the application developer.

After talking to Elena yesterday and thinking about this a bit more, I'm leaning toward the following solution, which is a variant of option #3 that I sent out yesterday:

  3' - Make a new dataset transfer property for filling in the values of missing chunks (which is similar to, but not the same as the role played by fill values). The property could have the following values:
    FAIL_IF_MISSING_CHUNKS - Make H5Dread() fail if there's a missing chunk.
    STORE_ZERO_FOR_MISSING_CHUNKS - Store zeroes in the memory buffer when there's a missing chunk.
    STORE_FILL_VALUE_FOR_MISSING_CHUNKS - Store the dataset's fill-value in the memory buffer when there's a missing chunk (regardless of the fill-time setting).
    STORE_NOTHING_IF_MISSING_CHUNKS - Continue current behavior of not modifying the memory buffer when there's a missing chunk.
    STORE_MISSING_VALUE_FOR_MISSING_CHUNKS - Store a "missing value" (set with another property list API routine) in the memory buffer when there's a missing chunk.

Our current behavior is (obviously) STORE_NOTHING_IF_MISSING_CHUNKS, but it's fairly easy to argue that this is a bug and the new default behavior ought to be STORE_ZERO_FOR_MISSING_CHUNKS, which would be consistent with how contiguous datasets work on a POSIX-compliant file system. I mentioned in my previous message that this might be persisted as a dataset creation property, but I don't think that's a good idea, since it is affecting the memory buffer of future applications, not the information stored in the file.

I think the distinction between fill-values and missing-values is that fill-values [generally] operate during writing and missing-values operate during reading. I.e. fill-values answer the question "What value do I store in the file when I've allocated space but not written to it?" and missing-values answer the question "What value do I store in an application's memory buffer when there's no values to write to it?"

Any opinions about this? (Especially from others on the forum!

Quincey

···

On Jul 20, 2009, at 7:01 PM, Andrew Collette wrote:

andrew.collette · July 21, 2009, 7:47pm

Hi Quincey,

The more I work with fill values the less I understand them. I
don't mean to monopolize the discussion here, either. Who else has
suggestions?

I think the distinction between fill-values and missing-values is
that fill-values [generally] operate during writing and missing-values
operate during reading. I.e. fill-values answer the question "What value do
I store in the file when I've allocated space but not written to it?" and
missing-values answer the question "What value do I store in an
application's memory buffer when there's no values to write to it?"

I think there's a language ambiguity here which is tripping me up. It
seems like a "fill value" should mean the value returned from a
dataset when no data has been written to that location, under all
circumstances. It may be that this is impossible to achieve for
performance reasons in HDF5. As I understand it, currently fill
values are handled by physically writing the value when the space on
disk is allocated... is this correct? This would seem to make it
impossible to handle fill values for large datasets, at least for the
contiguous case.

   3&#39; \- Make a new dataset transfer property for filling in the values
of missing chunks (which is similar to, but not the same as the role played
by fill values). The property could have the following values:

For chunked datasets, you have the advantage that you know what parts
of the dataset are "real". It seems to me that the simplest way to
handle this would be to extend the definition of fill values to
include read-time filling as well. If there's a fill value X, then
"blank" parts of partially initialized chunks will have X already; you
should obviously then use X for missing chunks. If there's no fill
value, then these parts will be set to 0 by the OS; in that case,
missing chunks should be initialized to 0 to match.

           FAIL\_IF\_MISSING\_CHUNKS \- Make H5Dread\(\) fail if there&#39;s a
missing chunk.
STORE_ZERO_FOR_MISSING_CHUNKS - Store zeroes in the memory
buffer when there's a missing chunk.
STORE_FILL_VALUE_FOR_MISSING_CHUNKS - Store the dataset's
fill-value in the memory buffer when there's a missing chunk (regardless of
the fill-time setting).
STORE_NOTHING_IF_MISSING_CHUNKS - Continue current behavior
of not modifying the memory buffer when there's a missing chunk.
STORE_MISSING_VALUE_FOR_MISSING_CHUNKS - Store a "missing
value" (set with another property list API routine) in the memory buffer
when there's a missing chunk.

The last two options here don't play well with partially-initialized
chunks which already have a fill value. It seems like you can simply
narrow down the behavior to the following:

1. No fill value defined, or fill time set to "never": use 0.
2. Fill value defined, and fill time is "ifset" or "alloc": use the fill value.

This obviously won't work for contiguous datasets (since you have no
way of knowing what parts are not really on disk) but I think the
current behavior is well-enough established that it won't matter.

Thanks again,
Andrew

robl · July 21, 2009, 8:19pm

I'm afraid I don't have a very helpful suggestion. In parallel-netcdf
land, we decided arbitrary fill values were such a headache that we
only support a fill value of 0, which as Quincey mentioned a few
emails back is a posix-standard default fill value.

==rob

···

On Tue, Jul 21, 2009 at 12:47:05PM -0700, Andrew Collette wrote:

Hi Quincey,

The more I work with fill values the less I understand them. I
don't mean to monopolize the discussion here, either. Who else has
suggestions?

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Quincey_Koziol · July 23, 2009, 12:17pm

Hi Andrew,

Hi Quincey,

The more I work with fill values the less I understand them. I
don't mean to monopolize the discussion here, either. Who else has
suggestions?

I think the distinction between fill-values and missing-values is
that fill-values [generally] operate during writing and missing-values
operate during reading. I.e. fill-values answer the question "What value do
I store in the file when I've allocated space but not written to it?" and
missing-values answer the question "What value do I store in an
application's memory buffer when there's no values to write to it?"

I think there's a language ambiguity here which is tripping me up. It
seems like a "fill value" should mean the value returned from a
dataset when no data has been written to that location, under all
circumstances. It may be that this is impossible to achieve for
performance reasons in HDF5. As I understand it, currently fill
values are handled by physically writing the value when the space on
disk is allocated... is this correct?

Yes, that's correct.

This would seem to make it
impossible to handle fill values for large datasets, at least for the
contiguous case.

It's not impossible, it just can take a long time... That's why our defaults are the way they are ("ifset" fill-time and a default fill-value of all zeroes)

       3' - Make a new dataset transfer property for filling in the values
of missing chunks (which is similar to, but not the same as the role played
by fill values). The property could have the following values:

For chunked datasets, you have the advantage that you know what parts
of the dataset are "real". It seems to me that the simplest way to
handle this would be to extend the definition of fill values to
include read-time filling as well. If there's a fill value X, then
"blank" parts of partially initialized chunks will have X already; you
should obviously then use X for missing chunks. If there's no fill
value, then these parts will be set to 0 by the OS; in that case,
missing chunks should be initialized to 0 to match.

               FAIL_IF_MISSING_CHUNKS - Make H5Dread() fail if there's a
missing chunk.
               STORE_ZERO_FOR_MISSING_CHUNKS - Store zeroes in the memory
buffer when there's a missing chunk.
               STORE_FILL_VALUE_FOR_MISSING_CHUNKS - Store the dataset's
fill-value in the memory buffer when there's a missing chunk (regardless of
the fill-time setting).
               STORE_NOTHING_IF_MISSING_CHUNKS - Continue current behavior
of not modifying the memory buffer when there's a missing chunk.
               STORE_MISSING_VALUE_FOR_MISSING_CHUNKS - Store a "missing
value" (set with another property list API routine) in the memory buffer
when there's a missing chunk.

The last two options here don't play well with partially-initialized
chunks which already have a fill value.

Yes, that's true and it's another aspect of the "inconsistency" that we're trying to address here. Hmm...

It seems like you can simply narrow down the behavior to the following:

1. No fill value defined, or fill time set to "never": use 0.
2. Fill value defined, and fill time is "ifset" or "alloc": use the fill value.

This obviously won't work for contiguous datasets (since you have no
way of knowing what parts are not really on disk) but I think the
current behavior is well-enough established that it won't matter.

Hmm, yes, I think you have a point, although I'd like to leave room for the existing behavior and for issuing an error, if requested. Here's my revised suggestions for values of the property:

    FAIL_IF_MISSING_CHUNKS - Make H5Dread() fail if there's a missing chunk.
    USE_FILL_VALUE_BEHAVIOR_FOR_MISSING_CHUNKS - As you outline above, which is how contiguous datasets work in a POSIX file system. This would become the default.
    STORE_NOTHING_IF_MISSING_CHUNKS - Continue current behavior of not modifying the memory buffer when there's a missing chunk.

Quincey

···

On Jul 21, 2009, at 2:47 PM, Andrew Collette wrote:

andrew.collette · July 24, 2009, 6:23pm

Hi Quincey,

   Hmm, yes, I think you have a point, although I&#39;d like to leave room
for the existing behavior and for issuing an error, if requested. Here's my
revised suggestions for values of the property:
           FAIL\_IF\_MISSING\_CHUNKS \- Make H5Dread\(\) fail if there&#39;s a
missing chunk.
USE_FILL_VALUE_BEHAVIOR_FOR_MISSING_CHUNKS - As you outline
above, which is how contiguous datasets work in a POSIX file system. This
would become the default.
STORE_NOTHING_IF_MISSING_CHUNKS - Continue current behavior
of not modifying the memory buffer when there's a missing chunk.

This sounds fine. Thanks for investing time in this!

Andrew

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Resizing a dataset breaks fill value