How to write callback for H5Pset_type_conv_cb to quietly replace out-of-range values?

kmccarty · October 12, 2017, 5:45pm

Hi list,

I am doing some work that should convert integer datasets
automatically from a larger integer type in memory, to a smaller
integer type on disk, when appropriate.

To give a concrete example: I might have code that converts a uint32_t
dataset in memory to a uint16_t dataset on disk if it turns out that
the values in the in-memory dataset all can be expressed losslessly in
16 bits.

The problem is that I wish to allow for the possibility of one
specific value that does *not* fit in 16 bits, which however I'd like
to translate to a suitable 16-bit replacement value on disk. That is:

if (memValue == (uint32_t)(-1))
diskValue = (uint16_t)(-1); /* Quietly replace all instances of
4294967295 in RAM with 65535 on-disk */

It seems clear that in order to effect this automatic replacement, I
need to write a callback to be given to H5Pset_type_conv_cb() that
will catch overflows and make them instead quietly translate the
out-of-range value to the desired replacement value. What I don't
understand is what code should go in the body of the callback function
to do this. (Feel free to assume that the only out-of-range value
that might occur will be the specific value I wish to translate.)

I've not been able to find any examples showing how to write such a
callback function online. Advice would be greatly appreciated!

Thanks in advance,

···

--
Kevin B. McCarty
<kmccarty@gmail.com>

miller86 · October 12, 2017, 6:06pm

Hmmm. Do you care about whether the HDF5 dataset's type in the file shows as "uint32_t" for example? Or, do you simply care that you are not wasting space storing an array of 16 bit values using a 32 bit type?

If you DO NOT CARE about the HDF5's dataset type, my first thought would be to handle this as a filter instead of a type_cb callback. Have you considered that?

  * https://support.hdfgroup.org/HDF5/doc/RM/RM_H5Z.html
  * https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFilter
  * https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/

The filter could handle *both* the type size change and the special value mapping much like any "compression" filter would.

Now, consumers of the dataset data in memory would still only ever think that the type in memory was a 32 bit type, but the data stored on disk would be 16 bits.

Now, if you really do want the dataset's type to be some 16 bit type so that things like h5ls, h5dump, H5Dget_type all return a known, 16 bit type, then yeah, probably a custom type conversion is the way to go? But, note that it will still appear to be a *custom* type to HDF5 and not a built-in 16 bit type. Also, I don't think type conversion can be handled as a 'plugin' in the same way filters are so that anyone reading that data, would also need to have linked with (e.g. HDF5 will not -- at least I don't think it will -- load custom type conversion code from some plugin) your implementation of that type conversion callback.

Hope that helps.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Hi list,

I am doing some work that should convert integer datasets
automatically from a larger integer type in memory, to a smaller
integer type on disk, when appropriate.

To give a concrete example: I might have code that converts a uint32_t
dataset in memory to a uint16_t dataset on disk if it turns out that
the values in the in-memory dataset all can be expressed losslessly in
16 bits.

The problem is that I wish to allow for the possibility of one
specific value that does *not* fit in 16 bits, which however I'd like
to translate to a suitable 16-bit replacement value on disk. That is:

if (memValue == (uint32_t)(-1))
diskValue = (uint16_t)(-1); /* Quietly replace all instances of
4294967295 in RAM with 65535 on-disk */

It seems clear that in order to effect this automatic replacement, I
need to write a callback to be given to H5Pset_type_conv_cb() that
will catch overflows and make them instead quietly translate the
out-of-range value to the desired replacement value. What I don't
understand is what code should go in the body of the callback function
to do this. (Feel free to assume that the only out-of-range value
that might occur will be the specific value I wish to translate.)

I've not been able to find any examples showing how to write such a
callback function online. Advice would be greatly appreciated!

Thanks in advance,

···

--
Kevin B. McCarty
<kmccarty@gmail.com<mailto:kmccarty@gmail.com>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

miller86 · October 12, 2017, 6:18pm

You know...I think my response here is a bit confused.

Here's what I am getting at...an HDF5 dataset has a datatype associated with it. That type is determined at the time the dataset is created. The question is whether you want such datasets to be seen as *always* having uint32_t even if they are stored on disk as 16 bit or whether you plan to have the dataset's type to be determined by the condition of whether the data is indeed 16 bits or not? Obviously, in the latter case the caller winds up having to take some special action to select the data type upon creating the dataset to write to. In the former, the caller just always creates what it thinks are 32 bit datasets and then writes the data from memory to those datasets and, magic happens, and only 16 bit data is stored in the file if indeed the data written all fits in 16 bits.

Hope that makes a tad more sense.

Mark

"Miller, Mark C." wrote:

Hmmm. Do you care about whether the HDF5 dataset's type in the file shows as "uint32_t" for example? Or, do you simply care that you are not wasting space storing an array of 16 bit values using a 32 bit type?

If you DO NOT CARE about the HDF5's dataset type, my first thought would be to handle this as a filter instead of a type_cb callback. Have you considered that?

· https://support.hdfgroup.org/HDF5/doc/RM/RM_H5Z.html
· https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFilter
· https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/

The filter could handle *both* the type size change and the special value mapping much like any "compression" filter would.

Now, consumers of the dataset data in memory would still only ever think that the type in memory was a 32 bit type, but the data stored on disk would be 16 bits.

Now, if you really do want the dataset's type to be some 16 bit type so that things like h5ls, h5dump, H5Dget_type all return a known, 16 bit type, then yeah, probably a custom type conversion is the way to go? But, note that it will still appear to be a *custom* type to HDF5 and not a built-in 16 bit type. Also, I don't think type conversion can be handled as a 'plugin' in the same way filters are so that anyone reading that data, would also need to have linked with (e.g. HDF5 will not -- at least I don't think it will -- load custom type conversion code from some plugin) your implementation of that type conversion callback.

Hope that helps.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Hi list,

I am doing some work that should convert integer datasets
automatically from a larger integer type in memory, to a smaller
integer type on disk, when appropriate.

To give a concrete example: I might have code that converts a uint32_t
dataset in memory to a uint16_t dataset on disk if it turns out that
the values in the in-memory dataset all can be expressed losslessly in
16 bits.

The problem is that I wish to allow for the possibility of one
specific value that does *not* fit in 16 bits, which however I'd like
to translate to a suitable 16-bit replacement value on disk. That is:

if (memValue == (uint32_t)(-1))
diskValue = (uint16_t)(-1); /* Quietly replace all instances of
4294967295 in RAM with 65535 on-disk */

It seems clear that in order to effect this automatic replacement, I
need to write a callback to be given to H5Pset_type_conv_cb() that
will catch overflows and make them instead quietly translate the
out-of-range value to the desired replacement value. What I don't
understand is what code should go in the body of the callback function
to do this. (Feel free to assume that the only out-of-range value
that might occur will be the specific value I wish to translate.)

I've not been able to find any examples showing how to write such a
callback function online. Advice would be greatly appreciated!

Thanks in advance,

···

--
Kevin B. McCarty
<kmccarty@gmail.com<mailto:kmccarty@gmail.com>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

kmccarty · October 12, 2017, 7:17pm

Thanks for your replies, Mark!

So what I'd like to have happen is as follows (still using the case of
a uint32_t dataset in memory).

* When the dataset has values >= 2^16-1 (== 65535) that are *not*
equal to 2^32-1, the dataset will just be saved to disk also as a
bog-standard uint32_t dataset.

* When the dataset has ONLY values < 2^16-1, except perhaps for values
equal to 2^32-1, the dataset will be saved to disk as a bog-standard
uint16_t dataset, where any instances of 2^32-1 in RAM get translated
to instances of 2^16-1 on disk.

That is, later users that look at the HDF5 file in the second case
will see a 16-bit dataset, and may see values of 2^16-1 ... that's
fine, there is no need for those users to see 2^32-1 instead. I do a
pre-check and switch to determine which of these cases to fall into.

After some more research online, I found this document:

https://support.hdfgroup.org/HDF5/doc/Supplements/dtype_conversion/Conversion.html

and I see that it says if there is no "conversion exception" callback
defined, then any conversion between integers is going to be a "hard
conversion" which acts just as if one wrote
out_type out = (out_type)in;

So I think I actually don't need to do anything special to take care
of the conversion! ... since the code "uint16_t out = (uint16_t)in"
would translate 2^32-1 to 2^16-1 automatically, due to the modulus
properties of unsigned arithmetic.

Best regards,
Kevin

···

On Thu, Oct 12, 2017 at 11:18 AM, Miller, Mark C. <miller86@llnl.gov> wrote:

You know...I think my response here is a bit confused.

Here's what I am getting at...an HDF5 dataset has a datatype associated with
it. That type is determined at the time the dataset is created. The question
is whether you want such datasets to be seen as *always* having uint32_t
even if they are stored on disk as 16 bit or whether you plan to have the
dataset's type to be determined by the condition of whether the data is
indeed 16 bits or not? Obviously, in the latter case the caller winds up
having to take some special action to select the data type upon creating the
dataset to write to. In the former, the caller just always creates what it
thinks are 32 bit datasets and then writes the data from memory to those
datasets and, magic happens, and only 16 bit data is stored in the file if
indeed the data written all fits in 16 bits.

Hope that makes a tad more sense.

Mark

"Miller, Mark C." wrote:

Hmmm. Do you care about whether the HDF5 dataset's type in the file shows as
"uint32_t" for example? Or, do you simply care that you are not wasting
space storing an array of 16 bit values using a 32 bit type?

If you DO NOT CARE about the HDF5's dataset type, my first thought would be
to handle this as a filter instead of a type_cb callback. Have you
considered that?

· https://support.hdfgroup.org/HDF5/doc/RM/RM_H5Z.html

·
https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFilter

·
https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/

The filter could handle *both* the type size change and the special value
mapping much like any "compression" filter would.

Now, consumers of the dataset data in memory would still only ever think
that the type in memory was a 32 bit type, but the data stored on disk would
be 16 bits.

Now, if you really do want the dataset's type to be some 16 bit type so that
things like h5ls, h5dump, H5Dget_type all return a known, 16 bit type, then
yeah, probably a custom type conversion is the way to go? But, note that it
will still appear to be a *custom* type to HDF5 and not a built-in 16 bit
type. Also, I don't think type conversion can be handled as a 'plugin' in
the same way filters are so that anyone reading that data, would also need
to have linked with (e.g. HDF5 will not -- at least I don't think it will --
load custom type conversion code from some plugin) your implementation of
that type conversion callback.

Hope that helps.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Hi list,

I am doing some work that should convert integer datasets

automatically from a larger integer type in memory, to a smaller

integer type on disk, when appropriate.

To give a concrete example: I might have code that converts a uint32_t

dataset in memory to a uint16_t dataset on disk if it turns out that

the values in the in-memory dataset all can be expressed losslessly in

16 bits.

The problem is that I wish to allow for the possibility of one

specific value that does *not* fit in 16 bits, which however I'd like

to translate to a suitable 16-bit replacement value on disk. That is:

if (memValue == (uint32_t)(-1))

diskValue = (uint16_t)(-1); /* Quietly replace all instances of

4294967295 in RAM with 65535 on-disk */

It seems clear that in order to effect this automatic replacement, I

need to write a callback to be given to H5Pset_type_conv_cb() that

will catch overflows and make them instead quietly translate the

out-of-range value to the desired replacement value. What I don't

understand is what code should go in the body of the callback function

to do this. (Feel free to assume that the only out-of-range value

that might occur will be the specific value I wish to translate.)

I've not been able to find any examples showing how to write such a

callback function online. Advice would be greatly appreciated!

Thanks in advance,

--

Kevin B. McCarty

<kmccarty@gmail.com>

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@lists.hdfgroup.org

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Kevin B. McCarty
<kmccarty@gmail.com>

miller86 · October 12, 2017, 10:20pm

Sure. I am curious...what is the driver for this modality of operation?

One situation I am thinking it could be is where special/sentinel values are used within a buffer of data to indicate special meaning such as 'data-not-present' or 'value-missing'. Often, such values are chosen to be +/- INT_MAX or +/- DBL_MAX and then their presence can then prevent down-casting to smaller (fewer bits) type. Is that that is going on for you?

Finally, another thing occurred to me was setting the dataset 'fill value', https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_fill_value.htm though I am not sure what the restrictions are on the data type for the fill value relative to the data type for the dataset.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Thanks for your replies, Mark!

kmccarty · October 12, 2017, 10:36pm

Yes, precisely. In our case we use 2^32-1 as a "no-data" value for
uint32_t datasets, 2^16-1 as a "no-data" value for uint16_t datasets,
etc.

Thanks again for your help!
Kevin

···

On Thu, Oct 12, 2017 at 3:20 PM, Miller, Mark C. <miller86@llnl.gov> wrote:

Sure. I am curious...what is the driver for this modality of operation?

One situation I am thinking it could be is where special/sentinel values are
used within a buffer of data to indicate special meaning such as
'data-not-present' or 'value-missing'. Often, such values are chosen to be
+/- INT_MAX or +/- DBL_MAX and then their presence can then prevent
down-casting to smaller (fewer bits) type. Is that that is going on for you?

Finally, another thing occurred to me was setting the dataset 'fill value',
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_fill_value.htm though I
am not sure what the restrictions are on the data type for the fill value
relative to the data type for the dataset.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Thanks for your replies, Mark!

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Kevin B. McCarty
<kmccarty@gmail.com>

koziol · October 13, 2017, 2:56pm

Hi Kevin,
I think you have mostly addressed your concerns, but I wanted to add that I think H5Pset_type_conv_cb() would be the correct routine to use, if necessary. If that _doesn’t_ do what you want, please reply here. (BTW, there are examples of using H5Pset_type_conv_cb in the test directory of the release tarballs)

Quincey

···

On Oct 12, 2017, at 5:36 PM, Kevin B. McCarty <kmccarty@gmail.com> wrote:

Yes, precisely. In our case we use 2^32-1 as a "no-data" value for
uint32_t datasets, 2^16-1 as a "no-data" value for uint16_t datasets,
etc.

Thanks again for your help!
Kevin

On Thu, Oct 12, 2017 at 3:20 PM, Miller, Mark C. <miller86@llnl.gov> wrote:

Sure. I am curious...what is the driver for this modality of operation?

One situation I am thinking it could be is where special/sentinel values are
used within a buffer of data to indicate special meaning such as
'data-not-present' or 'value-missing'. Often, such values are chosen to be
+/- INT_MAX or +/- DBL_MAX and then their presence can then prevent
down-casting to smaller (fewer bits) type. Is that that is going on for you?

Finally, another thing occurred to me was setting the dataset 'fill value',
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_fill_value.htm though I
am not sure what the restrictions are on the data type for the fill value
relative to the data type for the dataset.

Mark

"Hdf-forum on behalf of Kevin B. McCarty" wrote:

Thanks for your replies, Mark!

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Kevin B. McCarty
<kmccarty@gmail.com>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

How to write callback for H5Pset_type_conv_cb to quietly replace out-of-range values?