Hi,

I have a set of one dimensional chunked datasets of modest size (larger

than available memory). I am looking for the most efficient (mostly in

terms of time) way to insert new values into the dataset. A representative

example of what I am trying to do is: I have a dataset of 10 billion sorted

doubles, and I have a vector (in memory) of 1000 random (sorted) doubles,

and I want to insert the values of the vector into the dataset. One way

would be to write out a new dataset reading the larger one and merging the

vector values as they come up. I could improve this by carefully writing

into the existing dataset but it would still involve a lot of data

movement. I was hoping that because the dataset is chunked there may be

other ways to accomplish the insertion. Thanks for any suggestions.

Matt

Quincey,

Thanks for the reply. My current solution involves breaking the data into

smaller datasets, and accepting the cost of rewriting the smaller sets. In

effect, I swapped chunks for datasets and did the optimizations at the

dataset level.

I would be interested in at least knowing in theory how to patch the source

code to implement this and would be more than willing to share the outcome

of any such work. Knowing where to begin would be a big help.

Matt

## ···

On Wed, Apr 25, 2012 at 7:25 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Matt.

On Apr 11, 2012, at 7:23 AM, Matt Calder wrote:

> Hi,

>

> I have a set of one dimensional chunked datasets of modest size (larger

than available memory). I am looking for the most efficient (mostly in

terms of time) way to insert new values into the dataset. A representative

example of what I am trying to do is: I have a dataset of 10 billion sorted

doubles, and I have a vector (in memory) of 1000 random (sorted) doubles,

and I want to insert the values of the vector into the dataset. One way

would be to write out a new dataset reading the larger one and merging the

vector values as they come up. I could improve this by carefully writing

into the existing dataset but it would still involve a lot of data

movement. I was hoping that because the dataset is chunked there may be

other ways to accomplish the insertion. Thanks for any suggestions.

The HDF5 library doesn't currently perform this sort of insertion

on chunked datasets, although its technically feasible. If you'd like to

work on algorithms to add that sort of operation, we'd be happy to work

with you to guide you through the source code to create a patch that adds

the capability. Alternatively, if you'd like to fund this activity, that's

possible with us also.

Quincey

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@hdfgroup.org

http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Matt,

Quincey,

Thanks for the reply. My current solution involves breaking the data into smaller datasets, and accepting the cost of rewriting the smaller sets. In effect, I swapped chunks for datasets and did the optimizations at the dataset level.

That'll definitely work.

I would be interested in at least knowing in theory how to patch the source code to implement this and would be more than willing to share the outcome of any such work. Knowing where to begin would be a big help.

You'll need to look at the code in src/H5Dchunk.c and src/H5Dbtree.c where the chunks are operated on. However, first, can you write up a short description of exactly what the feature you are planning to add would do, and the interface you'd like to have for that functionality and send it to me? That'll help guide how the feature should be implemented. Feel free to email me off-list at: koziol@hdfgroup.org.

Quincey

## ···

On Apr 25, 2012, at 6:38 AM, Matt Calder wrote:

Matt

On Wed, Apr 25, 2012 at 7:25 AM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Matt.

On Apr 11, 2012, at 7:23 AM, Matt Calder wrote:

> Hi,

>

> I have a set of one dimensional chunked datasets of modest size (larger than available memory). I am looking for the most efficient (mostly in terms of time) way to insert new values into the dataset. A representative example of what I am trying to do is: I have a dataset of 10 billion sorted doubles, and I have a vector (in memory) of 1000 random (sorted) doubles, and I want to insert the values of the vector into the dataset. One way would be to write out a new dataset reading the larger one and merging the vector values as they come up. I could improve this by carefully writing into the existing dataset but it would still involve a lot of data movement. I was hoping that because the dataset is chunked there may be other ways to accomplish the insertion. Thanks for any suggestions.

The HDF5 library doesn't currently perform this sort of insertion on chunked datasets, although its technically feasible. If you'd like to work on algorithms to add that sort of operation, we'd be happy to work with you to guide you through the source code to create a patch that adds the capability. Alternatively, if you'd like to fund this activity, that's possible with us also.

Quincey

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@hdfgroup.org

http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________

Hdf-forum is for HDF software users discussion.

Hdf-forum@hdfgroup.org

http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org