How best to model objects that have lists of sub objects.

Hi,

I am unable to find a satisfactory way to model object hierarchies, specifically something as follows:

class B { ... };

class A {
  std::vector <B> m_b;
};

Here are some of my requirements:
   * The number of B's is relatively small, probably lower than 10
   * There's no upper bound
   * It needs to be possible to append data later
   * The order the items are written needs to be preserved

My initial thoughts were to have a new dataset for each "list" of sub objects and then reference it from the owning object. In this model, each 'A' object has it's own list of indexes to B objects:

[A DATASET]
data1, "DATASET REF A_B_1"
data2, "DATASET REF A_B_2"
data3, "DATASET REF A_B_3"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_1 DATASET]
Index to B1
Index to B3
Index to B4

[A_B_2 DATASET]
Index to B2
Index to B5

[A_B_3 DATASET]
Index to B3

Unfortunately, at least the way I implemented this, the approach was too slow. It also seemed to be the case that as lots and lots of datasets were added the performance would degrade significantly.

My current approach is to use region references. In order to preserve the write order, I write the written index to a LIST dataset and once all B's have been written I create region references to the list indexes.

The above therefore looks like:

[A's DATASET]
data1, "DATASET REGION REF 0,2"
data2, "DATASET REGION REF 3,4"
data3, "DATASET REGION REF 5"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_LIST]
Index to B1
Index to B3
Index to B4
Index to B2
Index to B5
Index to B3

Time wise, this performed significantly better than the previous model (and meets my requirements), however, it seems that the size of the hdf file is now extremely large, mostly due to the region references.

I have a test case with about 10k records and each record has 4 region references. If I create and write out all the data with the correct region references, then the size of the resulting file is about 2.5Mb. If I write the data with "empty" region references then size of the file is about 250k.

Is this expected? Is there a way I can optimise this?

Finally, is there an standard approach for modelling this kind of data HDF that I should be using?

Many thanks for your time,

Richard

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

This is practically the problem of having a data structure of the kind

vector<vector<B> >

with varying length of vector<B> in each row, right?

I'd be interested in an elegant solution for this as well; for now I would
flatten the hierarchy into some linear vector<B> and save a dataset of
integer indices into this linear array that allow reconstruction of
vector<vector<B> >, and save two datasets, the B's and the indices
in the linear area that define the hierarchy (1D integers) in HDF5.
It seems you're more or less doing that anyway already (I have not
used references myself yet).

  Werner

···

On Thu, 18 Jun 2009 15:36:24 -0500, Richard Corden <richard.corden@gmail.com> wrote:

Hi,

I am unable to find a satisfactory way to model object hierarchies,
specifically something as follows:

class B { ... };

class A {
  std::vector <B> m_b;
};

Here are some of my requirements:
   * The number of B's is relatively small, probably lower than 10
   * There's no upper bound
   * It needs to be possible to append data later
   * The order the items are written needs to be preserved

My initial thoughts were to have a new dataset for each "list" of sub
objects and then reference it from the owning object. In this model,
each 'A' object has it's own list of indexes to B objects:

[A DATASET]
data1, "DATASET REF A_B_1"
data2, "DATASET REF A_B_2"
data3, "DATASET REF A_B_3"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_1 DATASET]
Index to B1
Index to B3
Index to B4

[A_B_2 DATASET]
Index to B2
Index to B5

[A_B_3 DATASET]
Index to B3

Unfortunately, at least the way I implemented this, the approach was too
slow. It also seemed to be the case that as lots and lots of datasets
were added the performance would degrade significantly.

My current approach is to use region references. In order to preserve
the write order, I write the written index to a LIST dataset and once
all B's have been written I create region references to the list indexes.

The above therefore looks like:

[A's DATASET]
data1, "DATASET REGION REF 0,2"
data2, "DATASET REGION REF 3,4"
data3, "DATASET REGION REF 5"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_LIST]
Index to B1
Index to B3
Index to B4
Index to B2
Index to B5
Index to B3

Time wise, this performed significantly better than the previous model
(and meets my requirements), however, it seems that the size of the hdf
file is now extremely large, mostly due to the region references.

I have a test case with about 10k records and each record has 4 region
references. If I create and write out all the data with the correct
region references, then the size of the resulting file is about 2.5Mb.
If I write the data with "empty" region references then size of the file
is about 250k.

Is this expected? Is there a way I can optimise this?

Finally, is there an standard approach for modelling this kind of data
HDF that I should be using?

Many thanks for your time,

Richard

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Richard,

Hi,

I am unable to find a satisfactory way to model object hierarchies, specifically something as follows:

class B { ... };

class A {
std::vector <B> m_b;
};

Here are some of my requirements:
* The number of B's is relatively small, probably lower than 10
* There's no upper bound
* It needs to be possible to append data later
* The order the items are written needs to be preserved

My initial thoughts were to have a new dataset for each "list" of sub objects and then reference it from the owning object. In this model, each 'A' object has it's own list of indexes to B objects:

[A DATASET]
data1, "DATASET REF A_B_1"
data2, "DATASET REF A_B_2"
data3, "DATASET REF A_B_3"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_1 DATASET]
Index to B1
Index to B3
Index to B4

[A_B_2 DATASET]
Index to B2
Index to B5

[A_B_3 DATASET]
Index to B3

Unfortunately, at least the way I implemented this, the approach was too slow. It also seemed to be the case that as lots and lots of datasets were added the performance would degrade significantly.

My current approach is to use region references. In order to preserve the write order, I write the written index to a LIST dataset and once all B's have been written I create region references to the list indexes.

The above therefore looks like:

[A's DATASET]
data1, "DATASET REGION REF 0,2"
data2, "DATASET REGION REF 3,4"
data3, "DATASET REGION REF 5"

[B DATASET]
B1
B2
B3
B4
B5

[A_B_LIST]
Index to B1
Index to B3
Index to B4
Index to B2
Index to B5
Index to B3

Time wise, this performed significantly better than the previous model (and meets my requirements), however, it seems that the size of the hdf file is now extremely large, mostly due to the region references.

I have a test case with about 10k records and each record has 4 region references. If I create and write out all the data with the correct region references, then the size of the resulting file is about 2.5Mb. If I write the data with "empty" region references then size of the file is about 250k.

Is this expected? Is there a way I can optimise this?

Finally, is there an standard approach for modelling this kind of data HDF that I should be using?

  Hmm, you basically want a dataset with variable-length datatypes at each element, and to append to each element easily. Unfortunately, that last requirement (appending to the variable-length sequence for each element) is not currently supported in a straightforward way (you'd have to read the element, append the new information to the sequence and re-write the element in the file).

  Have you tried creating a group for each 'A' and making each 'B' a dataset with an unlimited 1-D dimension, with a datatype that matches 'B' in memory? Then, each vector of 'B' elements can be easily extended and appended to in an independent way.

  Quincey

···

On Jun 18, 2009, at 3:36 PM, Richard Corden wrote:

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi,

Richard Corden wrote:

Hi,

I am unable to find a satisfactory way to model object hierarchies, specifically something as follows:

class B { ... };

class A {
std::vector <B> m_b;
};

Here are some of my requirements:
  * The number of B's is relatively small, probably lower than 10
  * There's no upper bound
  * It needs to be possible to append data later
  * The order the items are written needs to be preserved

To update my status on this. I have settled on a particular technique for solving this (using a linked list model). Basically, there is an extra column in B's dataset that points to the previous entry in the list (if any):

[A DATASET]
A_data1, B(3)
A_data2, B(4)

[B DATASET]
0: B_data1, B(-1)
1: B_data2, B(0)
2: B_data3, B(-1)
3: B_data4, B(1)
4: B_data5, B(2)

The above models an object hierarchy of:

A1 = { A_data1, { B_data4, B_data2, B_data1 } };
A2 = { A_data2, { B_data5, B_data3 } } ;

Regards,

Richard

Hi Quincey,

Quincey Koziol wrote:

Hi Richard,

Hi,

I am unable to find a satisfactory way to model object hierarchies, specifically something as follows:

class B { ... };

class A {
std::vector <B> m_b;
};

Here are some of my requirements:
* The number of B's is relatively small, probably lower than 10
* There's no upper bound
* It needs to be possible to append data later
* The order the items are written needs to be preserved

[...]

    Have you tried creating a group for each 'A' and making each 'B' a dataset with an unlimited 1-D dimension, with a datatype that matches 'B' in memory? Then, each vector of 'B' elements can be easily extended and appended to in an independent way.

I haven't tried this yet, but it will have to be a last resort as I've invested quite a bit of time designing a framework based on HDF5 where each object is a dataset row. (FYI, I've written reader/writer objects where the interface uses tuples from the boost library (www.boost.org). Each part of the tuple has an identifier and a value, which write out as if it was a HDF Compound type).

I've managed to work around the size problem by storing the start and end indexes rather than references. I always know the dataset that it refers to so that information wasn't required. So far, read/write times are good and file size is good too.

Thanks for your help!

Cheers,

Richard

···

On Jun 18, 2009, at 3:36 PM, Richard Corden wrote:

    Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

The 'Root' software package seems to have a successful object state
persistence model where scalability and the file format specification
is similar to that of the HDF5 spec. With there model in mind, I could
see scalable, flexible, object persistence in HDF5 with out too much
difficulty.

The 'Root' documentation has detailed their persistence model in
detail. It's worth seeing what they do and how it might be adopted to
our favorite data storage facility... HDF5 :slight_smile:

···

On Tue, Jun 23, 2009 at 11:34 AM, Richard Corden<richard.corden@gmail.com> wrote:

Hi Quincey,

Quincey Koziol wrote:

Hi Richard,

On Jun 18, 2009, at 3:36 PM, Richard Corden wrote:

Hi,

I am unable to find a satisfactory way to model object hierarchies,
specifically something as follows:

class B { ... };

class A {
std::vector <B> m_b;
};

Here are some of my requirements:
* The number of B's is relatively small, probably lower than 10
* There's no upper bound
* It needs to be possible to append data later
* The order the items are written needs to be preserved

[...]

Have you tried creating a group for each 'A' and making each 'B' a
dataset with an unlimited 1-D dimension, with a datatype that matches 'B' in
memory? Then, each vector of 'B' elements can be easily extended and
appended to in an independent way.

I haven't tried this yet, but it will have to be a last resort as I've
invested quite a bit of time designing a framework based on HDF5 where each
object is a dataset row. (FYI, I've written reader/writer objects where the
interface uses tuples from the boost library (www.boost.org). Each part of
the tuple has an identifier and a value, which write out as if it was a HDF
Compound type).

I've managed to work around the size problem by storing the start and end
indexes rather than references. I always know the dataset that it refers to
so that information wasn't required. So far, read/write times are good and
file size is good too.

Thanks for your help!

Cheers,

Richard

Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Richard,

mapping a C++ object tree to HDF5 objects is a thing I also had in mind to
set up at some point. Neglecting the "resurrection" (instead of
de-serialization) part of the tree where cyclic graphs can prove a
nightmare.... what I do not seem to get right out of your approach is why
you say "In this model, each 'A' object has it's own list of indexes to B
objects". std::vector<B> is a container of B objects, therefore a dataset on
its own, right? Wouldn't this look like:

[A DATASET]
data1, "DATASET REF A_B_1"
data2, "DATASET REF A_B_2"
data3, "DATASET REF A_B_3"

[A_B_1 DATASET]
B1
B3
B4

[A_B_2 DATASET]
B2
B5

[A_B_3 DATASET]
B3

I probably miss something?!?

Peter,

could you send a link to that 'Root' software package? 'Root' is a really
bad name when googling....

Regards,

-- dimitris

Richard,

mapping a C++ object tree to HDF5 objects is a thing I also had in mind to
set up at some point. Neglecting the "resurrection" (instead of
de-serialization) part of the tree where cyclic graphs can prove a
nightmare.... what I do not seem to get right out of your approach is why
you say "In this model, each 'A' object has it's own list of indexes to B
objects". std::vector<B> is a container of B objects, therefore a dataset on
its own, right? Wouldn't this look like:

[A DATASET]
data1, "DATASET REF A_B_1"
data2, "DATASET REF A_B_2"
data3, "DATASET REF A_B_3"

[A_B_1 DATASET]
B1
B3
B4

[A_B_2 DATASET]
B2
B5

[A_B_3 DATASET]
B3

I probably miss something?!?

Peter,

could you send a link to that 'Root' software package? 'Root' is a really
bad name when googling....

I hope this helps us find an efficient and flexible model for object
persistence in consideration of the existing HDF5 data storage model.
ROOT is extremely interesting package.. I just wish it was HDF5 based
with object persistence as a secondary consideration.

http://root.cern.ch/drupal/

···

On Tue, Jun 23, 2009 at 2:12 PM, Dimitris Servis<servisster@gmail.com> wrote:

Regards,

-- dimitris

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.