first non-fill-value in the sparse chunked dataset

Hi,

I am using a sparse chunked dataset with a certain fill value. I'd like to find a first non-fill-value element in the dataset. Can I narrow down my search to a first available chunk? How can I do it?

Thank you,
Efim Dyadkin
------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.

The "first non-fill-value" in which order? (chronological, C-order, ...)

Short answer: No chance.

Slightly longer: (Apart from H5DOwrite_chunk...) There is currently no API that
gives you direct control over/introspection into chunks. You can control certain
aspects of chunk allocation time and policy (via dataset creation properties),
but the rest is pretty opaque and a side-effect of H5D[read,write].
I think you have at least two options:

1. Create an auxiliary structure where you maintain that type of log information.
   (This is dangerous/illusionary because you'll be making assumptions about how the
    HDF5 library writes/updates chunks, and what happens in the underlying storage.)

2. Create a proper sparse structure and don't use chunking to mimic one.
   (You might still struggle with the definition of 'first.')

G.

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Efim Dyadkin
Sent: Wednesday, April 19, 2017 5:04 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] first non-fill-value in the sparse chunked dataset

Hi,

I am using a sparse chunked dataset with a certain fill value. I'd like to find a first non-fill-value element in the dataset. Can I narrow down my search to a first available chunk? How can I do it?

Thank you,
Efim Dyadkin
------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.

Sorry I should have specified what "first" is. I have a 2d dataset with slower dimension sparse and unlimited,

and with fast dimension non-sparse and of fixed length. Typically for my data, information can be written first

in the "middle" of the slower dimension of the dataset and then grow in any direction (to the left and to the right)

incrementally. I need to keep track of current bounding box in order to only access populated part of the dataset.

The upper boundary of the slower dimension is basically an extent of the dataset so I do not need to store it

on my own. As to lower boundary I hoped I could find it by getting access to a first available chunk with

a smallest index along slower dimension.

I think exposing at least a boolean grid of existing chunks could be helpful for sparse data handling.

Thanks,

Efim

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Gerd Heber <gheber@hdfgroup.org>
Sent: Thursday, April 20, 2017 7:20 AM
To: HDF Users Discussion List
Subject: [**EXTERNAL**] Re: [Hdf-forum] first non-fill-value in the sparse chunked dataset

The “first non-fill-value” in which order? (chronological, C-order, …)

Short answer: No chance.

Slightly longer: (Apart from H5DOwrite_chunk…) There is currently no API that

gives you direct control over/introspection into chunks. You can control certain

aspects of chunk allocation time and policy (via dataset creation properties),

but the rest is pretty opaque and a side-effect of H5D[read,write].

I think you have at least two options:

1. Create an auxiliary structure where you maintain that type of log information.

   (This is dangerous/illusionary because you’ll be making assumptions about how the

    HDF5 library writes/updates chunks, and what happens in the underlying storage.)

2. Create a proper sparse structure and don’t use chunking to mimic one.

   (You might still struggle with the definition of ‘first.’)

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Efim Dyadkin
Sent: Wednesday, April 19, 2017 5:04 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] first non-fill-value in the sparse chunked dataset

Hi,

I am using a sparse chunked dataset with a certain fill value. I’d like to find a first non-fill-value element in the dataset. Can I narrow down my search to a first available chunk? How can I do it?

Thank you,

Efim Dyadkin

------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.

------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.

Efim,

Can you simply add a scalar integer attribute that keeps track of the lower
bound index value of the slower dimension? Just update this attribute
every time you write to the data set, or at least every time the lower
bound goes lower. This would be an application level solution, rather than
something provided by the library.

This resembles a minimal version of Gerd's suggestion #1.

--Dave

···

On Thu, Apr 20, 2017 at 8:39 AM, Efim Dyadkin <Efim.Dyadkin@pdgm.com> wrote:

Sorry I should have specified what "first" is. I have a 2d dataset with
slower dimension sparse and unlimited,

and with fast dimension non-sparse and of fixed length. Typically for my
data, information can be written first

in the "middle" of the slower dimension of the dataset and then grow in
any direction (to the left and to the right)

incrementally. I need to keep track of current bounding box in order to
only access populated part of the dataset.

The upper boundary of the slower dimension is basically an extent of the
dataset so I do not need to store it

on my own. As to lower boundary I hoped I could find it by getting access
to a first available chunk with

a smallest index along slower dimension.

I think exposing at least a boolean grid of existing chunks could be
helpful for sparse data handling.

Thanks,

Efim

*From:* Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of
Gerd Heber <gheber@hdfgroup.org>
*Sent:* Thursday, April 20, 2017 7:20 AM
*To:* HDF Users Discussion List
*Subject:* [**EXTERNAL**] Re: [Hdf-forum] first non-fill-value in the
sparse chunked dataset

The “first non-fill-value” in which order? (chronological, C-order, …)

Short answer: No chance.

Slightly longer: (Apart from H5DOwrite_chunk…) There is currently no API
that

gives you direct control over/introspection into chunks. You can control
certain

aspects of chunk allocation time and policy (via dataset creation
properties),

but the rest is pretty opaque and a side-effect of H5D[read,write].

I think you have at least two options:

1. Create an auxiliary structure where you maintain that type of log
information.

   (This is dangerous/illusionary because you’ll be making assumptions
about how the

    HDF5 library writes/updates chunks, and what happens in the underlying
storage.)

2. Create a proper sparse structure and don’t use chunking to mimic one.

   (You might still struggle with the definition of ‘first.’)

G.

*From:* Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] *On
Behalf Of *Efim Dyadkin

*Sent:* Wednesday, April 19, 2017 5:04 PM
*To:* hdf-forum@lists.hdfgroup.org
*Subject:* [Hdf-forum] first non-fill-value in the sparse chunked dataset

Hi,

I am using a sparse chunked dataset with a certain fill value. I’d like to
find a first non-fill-value element in the dataset. Can I narrow down my
search to a first available chunk? How can I do it?

Thank you,

Efim Dyadkin

Thank you Gerd and Dave.

Solution #1 is okay for my current task. However, ultimately, for performance of my app, I would like to visit only those areas of the sparse dataset where data really exists. From your answers and the documentation I learn that this information is available with chunk granularity in b-tree but apparently not exposed in API.

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Dave Allured - NOAA Affiliate <dave.allured@noaa.gov>
Sent: Thursday, April 20, 2017 10:16 AM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] [**EXTERNAL**] Re: first non-fill-value in the sparse chunked dataset

Efim,

Can you simply add a scalar integer attribute that keeps track of the lower bound index value of the slower dimension? Just update this attribute every time you write to the data set, or at least every time the lower bound goes lower. This would be an application level solution, rather than something provided by the library.

This resembles a minimal version of Gerd's suggestion #1.

--Dave

On Thu, Apr 20, 2017 at 8:39 AM, Efim Dyadkin <Efim.Dyadkin@pdgm.com<mailto:Efim.Dyadkin@pdgm.com>> wrote:

Sorry I should have specified what "first" is. I have a 2d dataset with slower dimension sparse and unlimited,

and with fast dimension non-sparse and of fixed length. Typically for my data, information can be written first

in the "middle" of the slower dimension of the dataset and then grow in any direction (to the left and to the right)

incrementally. I need to keep track of current bounding box in order to only access populated part of the dataset.

The upper boundary of the slower dimension is basically an extent of the dataset so I do not need to store it

on my own. As to lower boundary I hoped I could find it by getting access to a first available chunk with

a smallest index along slower dimension.

I think exposing at least a boolean grid of existing chunks could be helpful for sparse data handling.

Thanks,

Efim

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of Gerd Heber <gheber@hdfgroup.org<mailto:gheber@hdfgroup.org>>
Sent: Thursday, April 20, 2017 7:20 AM
To: HDF Users Discussion List
Subject: [**EXTERNAL**] Re: [Hdf-forum] first non-fill-value in the sparse chunked dataset

The “first non-fill-value” in which order? (chronological, C-order, …)

Short answer: No chance.

Slightly longer: (Apart from H5DOwrite_chunk…) There is currently no API that

gives you direct control over/introspection into chunks. You can control certain

aspects of chunk allocation time and policy (via dataset creation properties),

but the rest is pretty opaque and a side-effect of H5D[read,write].

I think you have at least two options:

1. Create an auxiliary structure where you maintain that type of log information.

   (This is dangerous/illusionary because you’ll be making assumptions about how the

    HDF5 library writes/updates chunks, and what happens in the underlying storage.)

2. Create a proper sparse structure and don’t use chunking to mimic one.

   (You might still struggle with the definition of ‘first.’)

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>] On Behalf Of Efim Dyadkin

Sent: Wednesday, April 19, 2017 5:04 PM
To: hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>
Subject: [Hdf-forum] first non-fill-value in the sparse chunked dataset

Hi,

I am using a sparse chunked dataset with a certain fill value. I’d like to find a first non-fill-value element in the dataset. Can I narrow down my search to a first available chunk? How can I do it?

Thank you,

Efim Dyadkin

------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.