Insert into one dimensional chunked dataset

Hi,

I have a set of one dimensional chunked datasets of modest size (larger
than available memory). I am looking for the most efficient (mostly in
terms of time) way to insert new values into the dataset. A representative
example of what I am trying to do is: I have a dataset of 10 billion sorted
doubles, and I have a vector (in memory) of 1000 random (sorted) doubles,
and I want to insert the values of the vector into the dataset. One way
would be to write out a new dataset reading the larger one and merging the
vector values as they come up. I could improve this by carefully writing
into the existing dataset but it would still involve a lot of data
movement. I was hoping that because the dataset is chunked there may be
other ways to accomplish the insertion. Thanks for any suggestions.

Matt

Hi Matt.

···

On Apr 11, 2012, at 7:23 AM, Matt Calder wrote:

Hi,

I have a set of one dimensional chunked datasets of modest size (larger than available memory). I am looking for the most efficient (mostly in terms of time) way to insert new values into the dataset. A representative example of what I am trying to do is: I have a dataset of 10 billion sorted doubles, and I have a vector (in memory) of 1000 random (sorted) doubles, and I want to insert the values of the vector into the dataset. One way would be to write out a new dataset reading the larger one and merging the vector values as they come up. I could improve this by carefully writing into the existing dataset but it would still involve a lot of data movement. I was hoping that because the dataset is chunked there may be other ways to accomplish the insertion. Thanks for any suggestions.

  The HDF5 library doesn't currently perform this sort of insertion on chunked datasets, although its technically feasible. If you'd like to work on algorithms to add that sort of operation, we'd be happy to work with you to guide you through the source code to create a patch that adds the capability. Alternatively, if you'd like to fund this activity, that's possible with us also.

  Quincey

Quincey,

Thanks for the reply. My current solution involves breaking the data into
smaller datasets, and accepting the cost of rewriting the smaller sets. In
effect, I swapped chunks for datasets and did the optimizations at the
dataset level.

I would be interested in at least knowing in theory how to patch the source
code to implement this and would be more than willing to share the outcome
of any such work. Knowing where to begin would be a big help.

Matt

···

On Wed, Apr 25, 2012 at 7:25 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Matt.

On Apr 11, 2012, at 7:23 AM, Matt Calder wrote:

> Hi,
>
> I have a set of one dimensional chunked datasets of modest size (larger
than available memory). I am looking for the most efficient (mostly in
terms of time) way to insert new values into the dataset. A representative
example of what I am trying to do is: I have a dataset of 10 billion sorted
doubles, and I have a vector (in memory) of 1000 random (sorted) doubles,
and I want to insert the values of the vector into the dataset. One way
would be to write out a new dataset reading the larger one and merging the
vector values as they come up. I could improve this by carefully writing
into the existing dataset but it would still involve a lot of data
movement. I was hoping that because the dataset is chunked there may be
other ways to accomplish the insertion. Thanks for any suggestions.

        The HDF5 library doesn't currently perform this sort of insertion
on chunked datasets, although its technically feasible. If you'd like to
work on algorithms to add that sort of operation, we'd be happy to work
with you to guide you through the source code to create a patch that adds
the capability. Alternatively, if you'd like to fund this activity, that's
possible with us also.

       Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Matt,

Quincey,

Thanks for the reply. My current solution involves breaking the data into smaller datasets, and accepting the cost of rewriting the smaller sets. In effect, I swapped chunks for datasets and did the optimizations at the dataset level.

  That'll definitely work. :slight_smile:

I would be interested in at least knowing in theory how to patch the source code to implement this and would be more than willing to share the outcome of any such work. Knowing where to begin would be a big help.

  You'll need to look at the code in src/H5Dchunk.c and src/H5Dbtree.c where the chunks are operated on. However, first, can you write up a short description of exactly what the feature you are planning to add would do, and the interface you'd like to have for that functionality and send it to me? That'll help guide how the feature should be implemented. Feel free to email me off-list at: koziol@hdfgroup.org.

  Quincey

···

On Apr 25, 2012, at 6:38 AM, Matt Calder wrote:

Matt

On Wed, Apr 25, 2012 at 7:25 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:
Hi Matt.

On Apr 11, 2012, at 7:23 AM, Matt Calder wrote:

> Hi,
>
> I have a set of one dimensional chunked datasets of modest size (larger than available memory). I am looking for the most efficient (mostly in terms of time) way to insert new values into the dataset. A representative example of what I am trying to do is: I have a dataset of 10 billion sorted doubles, and I have a vector (in memory) of 1000 random (sorted) doubles, and I want to insert the values of the vector into the dataset. One way would be to write out a new dataset reading the larger one and merging the vector values as they come up. I could improve this by carefully writing into the existing dataset but it would still involve a lot of data movement. I was hoping that because the dataset is chunked there may be other ways to accomplish the insertion. Thanks for any suggestions.

       The HDF5 library doesn't currently perform this sort of insertion on chunked datasets, although its technically feasible. If you'd like to work on algorithms to add that sort of operation, we'd be happy to work with you to guide you through the source code to create a patch that adds the capability. Alternatively, if you'd like to fund this activity, that's possible with us also.

       Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org