Help on "how to use" HDF5 format.

Hi,

I am considering migrating to HDF5 for the output for our tools (C/C++ static analysis). I understand that HDF5 is primarily optimised for huge datasets, however, a very attractive feature is the ability to separate the data in different groups in the file. The benefit is that the different clients of the information can access just the information they're interested in quickly.

The following kinds of information need to be stored:

   1. Lists of "diagnostics", "filenames", "locations".
   2. Entities/Type/Scope/Expression hierarchies.
   3. Call graphs, include trees.
   4. Properties, fundamentally key/value pairs.

We've tested HDF5 for the first kind of data and the results are good. The record lengths for the first category of data can be made to be consistent and in general all the data needs to be read in one go. The question is, how should we model the other more complicated data types, for example a type hierarchy. In C++ this would be implemented using inheritance, something (very basic) as follows:

    class Type { };

    class Fundamental : public Type {};

    class Function : public Type {
      Type * m_return;
      std::vector<Type*> m_parameters;
    };

    class Class : public Type {
      std::vector<Members> m_member;
    };

Would the above be best modelled using a group with attributes? What is the usual way to store hierarchies of data in HDF5 (if there is one! )

Many thinks for your help,

Regards,

Richard

Hi Richard,

···

On Apr 28, 2009, at 5:37 AM, Richard Corden wrote:

Hi,

I am considering migrating to HDF5 for the output for our tools (C/C++ static analysis). I understand that HDF5 is primarily optimised for huge datasets, however, a very attractive feature is the ability to separate the data in different groups in the file. The benefit is that the different clients of the information can access just the information they're interested in quickly.

The following kinds of information need to be stored:
Lists of "diagnostics", "filenames", "locations".
Entities/Type/Scope/Expression hierarchies.
Call graphs, include trees.
Properties, fundamentally key/value pairs.
We've tested HDF5 for the first kind of data and the results are good. The record lengths for the first category of data can be made to be consistent and in general all the data needs to be read in one go. The question is, how should we model the other more complicated data types, for example a type hierarchy. In C++ this would be implemented using inheritance, something (very basic) as follows:

class Type { };

class Fundamental : public Type {};

class Function : public Type {
  Type * m_return;
  std::vector<Type*> m_parameters;
};

class Class : public Type {
  std::vector<Members> m_member;
};

Would the above be best modelled using a group with attributes? What is the usual way to store hierarchies of data in HDF5 (if there is one! )

  You probably have multiple options, including the grouping ideas you present. You could also use compound datatypes to create a pseudo-class hierarchy...

  Quincey

Hi,

Quincey Koziol wrote:

Hi Richard,

The following kinds of information need to be stored:

   1. Lists of "diagnostics", "filenames", "locations".
   2. Entities/Type/Scope/Expression hierarchies.
   3. Call graphs, include trees.
   4. Properties, fundamentally key/value pairs.

We've tested HDF5 for the first kind of data and the results are good. The record lengths for the first category of data can be made to be consistent and in general all the data needs to be read in one go. The question is, how should we model the other more complicated data types, for example a type hierarchy. In C++ this would be implemented using inheritance, something (very basic) as follows:

    class Type { };

    class Fundamental : public Type {};

    class Function : public Type {
      Type * m_return;
      std::vector<Type*> m_parameters;
    };

    class Class : public Type {
      std::vector<Members> m_member;
    };

Would the above be best modelled using a group with attributes? What is the usual way to store hierarchies of data in HDF5 (if there is one! )

You probably have multiple options, including the grouping ideas you present. You could also use compound datatypes to create a pseudo-class hierarchy...

Maybe I should rephrase the question in terms of performance. Which of the following would be faster:

1) Distinct datasets for each concrete class type

2) A single dataset representing a complete flattened hierarchy and using Compound DataTypes to filter the members

3) Use a group to represent each distinct type and have the data stored in attributes.

Thanks for your help,

Richard

···

On Apr 28, 2009, at 5:37 AM, Richard Corden wrote:

I'm actually having a similar issue in my programs. A main question is
if the memory layout of derived classes is different from the base class.
This usually is the case as you add new members. Adding elements of
inhomogeneous types to a dataset is not good... similar like a dataset
that consists of std::vector<int> at each element.

I'd go for the distinct datasets for each class.

How many instances do you have for each class? It's basically the same
issue as

std::vector<Class>

vs.

std::vector<Fundamental>

Which class would you use in an array? How would you detect the class
type within such an array of class instances? Basically it's the same
question with an HDF5 dataset, which stores an array of class instances.

If you only want to store a single instance, not an array/dataset of classes,
then you can use an H5S_SCALAR dataset, with a compound type
corresponding to the class layout - types can include other compound
types and can therefore be hierachical.

  Werner

···

On Wed, 29 Apr 2009 07:48:03 -0500, Richard Corden <richard.corden@gmail.com> wrote:

Hi,

Quincey Koziol wrote:

Hi Richard,

On Apr 28, 2009, at 5:37 AM, Richard Corden wrote:

The following kinds of information need to be stored:

   1. Lists of "diagnostics", "filenames", "locations".
   2. Entities/Type/Scope/Expression hierarchies.
   3. Call graphs, include trees.
   4. Properties, fundamentally key/value pairs.

We've tested HDF5 for the first kind of data and the results are
good. The record lengths for the first category of data can be made
to be consistent and in general all the data needs to be read in one
go. The question is, how should we model the other more complicated
data types, for example a type hierarchy. In C++ this would be
implemented using inheritance, something (very basic) as follows:

    class Type { };

    class Fundamental : public Type {};

    class Function : public Type {
      Type * m_return;
      std::vector<Type*> m_parameters;
    };

    class Class : public Type {
      std::vector<Members> m_member;
    };

Would the above be best modelled using a group with attributes? What
is the usual way to store hierarchies of data in HDF5 (if there is one! )

You probably have multiple options, including the grouping ideas you
present. You could also use compound datatypes to create a
pseudo-class hierarchy...

Maybe I should rephrase the question in terms of performance. Which of
the following would be faster:

1) Distinct datasets for each concrete class type

2) A single dataset representing a complete flattened hierarchy and
using Compound DataTypes to filter the members

3) Use a group to represent each distinct type and have the data stored
in attributes.

Thanks for your help,

Richard

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.