Attributes vs. Small Datasets

Stefan_Novak · July 31, 2009, 3:02pm

Hi all,

I'm currently in a discussion with team members on designing an HDF5
structure to contain results from several analyses. One of the topics that
came up was storing metadata for the primary group which contains a second
group (called Data) which encapsulates the corresponding datasets of the
results. One side of the argument is to have a group called "Config" which
has single-element datasets corresponding to each attribute - the benefit of
doing this is being able to group "attributes" together if there's a common
theme between them. The other side of the argument (in which I firmly
stand) is assigning those attributes to the parent group. Granted you lose
the ability to group similar attributes, but it seems to be a bit more
efficient. Here's an illustration:

First framework:
MyGroup (Group)
  -Config (Group)
     -Attribute1 (Dataset)
        -Attribute2 (Dataset)
        -Attribute3 (Dataset)
        -AttributeGroup (Group)
           -Attribute4 (Dataset)
           -Attribute5 (Dataset)
           -Attribute6 (Dataset)
  -Data(Group)

Second framework:
MyGroup (Group)
  -Attribute1 (Attribute)
  -Attribute2 (Attribute)
  -Attribute3 (Attribute)
   -Attribute4 (Attribute)
  -Attribute5 (Attribute)
  -Attribute6 (Attribute)
  -Data (Group)

There are probably going to be 40-50 of these meta-data values. Can someone
offer some insight as to the efficiency of both schemes?

Thanks!

···

--
Stefan Novak
Sent from Greenbelt, Maryland, United States

werner · July 31, 2009, 5:56pm

Hm, using data sets for small data is quite inefficient - maybe not too much in
speed, but certainly in size. I had cases where an HDF5 file was significantly
larger than its corresponding dump as text file.

If you want to group attributes together, have you looked into named datatypes?

When you create an attribute, you need to specify a data type with it. This can
be a predefined datatype like H5T_NATIVE_FLOAT, or something user-defined. These
user-defined data types can be transient, or saved into a file, then becoming
"named data types". These named data types can be equipped with attributes as well.
Then all attributes of this named data types automatically share these same
attributes, implicitely.

That's how I do it in a similar case. Maybe that solution works for you as well.

Werner

···

On Fri, 31 Jul 2009 10:02:06 -0500, Stefan Novak <stefan.louis.novak@gmail.com> wrote:

Hi all,

I'm currently in a discussion with team members on designing an HDF5
structure to contain results from several analyses. One of the topics that
came up was storing metadata for the primary group which contains a second
group (called Data) which encapsulates the corresponding datasets of the
results. One side of the argument is to have a group called "Config" which
has single-element datasets corresponding to each attribute - the benefit of
doing this is being able to group "attributes" together if there's a common
theme between them. The other side of the argument (in which I firmly
stand) is assigning those attributes to the parent group. Granted you lose
the ability to group similar attributes, but it seems to be a bit more
efficient. Here's an illustration:

First framework:
MyGroup (Group)
  -Config (Group)
     -Attribute1 (Dataset)
        -Attribute2 (Dataset)
        -Attribute3 (Dataset)
        -AttributeGroup (Group)
           -Attribute4 (Dataset)
           -Attribute5 (Dataset)
           -Attribute6 (Dataset)
  -Data(Group)

Second framework:
MyGroup (Group)
  -Attribute1 (Attribute)
  -Attribute2 (Attribute)
  -Attribute3 (Attribute)
   -Attribute4 (Attribute)
  -Attribute5 (Attribute)
  -Attribute6 (Attribute)
  -Data (Group)

There are probably going to be 40-50 of these meta-data values. Can someone
offer some insight as to the efficiency of both schemes?

Thanks!

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Quincey_Koziol · August 4, 2009, 3:39am

Hi Stefan,

Hi all,

I'm currently in a discussion with team members on designing an HDF5 structure to contain results from several analyses. One of the topics that came up was storing metadata for the primary group which contains a second group (called Data) which encapsulates the corresponding datasets of the results. One side of the argument is to have a group called "Config" which has single-element datasets corresponding to each attribute - the benefit of doing this is being able to group "attributes" together if there's a common theme between them. The other side of the argument (in which I firmly stand) is assigning those attributes to the parent group. Granted you lose the ability to group similar attributes, but it seems to be a bit more efficient. Here's an illustration:

First framework:
MyGroup (Group)
  -Config (Group)
     -Attribute1 (Dataset)
        -Attribute2 (Dataset)
        -Attribute3 (Dataset)
        -AttributeGroup (Group)
           -Attribute4 (Dataset)
           -Attribute5 (Dataset)
           -Attribute6 (Dataset)
  -Data(Group)

Second framework:
MyGroup (Group)
  -Attribute1 (Attribute)
  -Attribute2 (Attribute)
  -Attribute3 (Attribute)
  -Attribute4 (Attribute)
  -Attribute5 (Attribute)
  -Attribute6 (Attribute)
  -Data (Group)

There are probably going to be 40-50 of these meta-data values. Can someone offer some insight as to the efficiency of both schemes?

Hmm, we've talked about making some form of hierarchical attributes off and on for a few years, but I don't think I see a profound need for them yet, so don't expect them anytime soon. Both of the ways you outline above would work and if you use the "compact" storage option for the datasets, they would be close to the efficiency of using attributes. (Still somewhat larger, but close, probably). I do lean toward the method you prefer - adding the attributes to the parent group (or other actual object that needs the additional metadata). After all, that's what attributes are designed to do - provide additional user-defined metadata about a particular object in an HDF5 file. The 'Config' group approach loses that direct attachment and is less self-describing.

Quincey

···

On Jul 31, 2009, at 10:02 AM, Stefan Novak wrote:

miller86 · August 3, 2009, 7:35pm

I can't be certain (I could but it'd mean I'd have to go read the manual
right now and I don't have time for that) but I thought HDF5 was 'smart'
about having a slew of 'small' attributes on an object such that it
could read/write them in a single I/O request to the underlying driver
instead of having separate I/O requests for each one. If each one of
your 'attribute' datums is on the size of a few ints or doubles or
whatever, then I think HDF5 would deem them 'small' and handle them
'smartly' meaning single I/O requests. That may be important to you in
terms of performance.

Also, a common thing I see many HDF5 users do is create 'attributes'
which store information natively associated with a dataset by the HDF5
library itself. For example, I'll see cases where users store the
dimensions of a dataset as attribute data associated with the dataset.
That is a waste because HDF5 lib already knows the dimensions of the
dataset. So, thats just something to be aware of; what, if anything, in
your attribute data is already stored by HDF5 natively and then which
HDF5 functions are needed to query that?

Mark

···

On Fri, 2009-07-31 at 12:56 -0500, Werner Benger wrote:

Hm, using data sets for small data is quite inefficient - maybe not too much in
speed, but certainly in size. I had cases where an HDF5 file was significantly
larger than its corresponding dump as text file.

If you want to group attributes together, have you looked into named datatypes?

When you create an attribute, you need to specify a data type with it. This can
be a predefined datatype like H5T_NATIVE_FLOAT, or something user-defined. These
user-defined data types can be transient, or saved into a file, then becoming
"named data types". These named data types can be equipped with attributes as well.
Then all attributes of this named data types automatically share these same
attributes, implicitely.

That's how I do it in a similar case. Maybe that solution works for you as well.

Werner

On Fri, 31 Jul 2009 10:02:06 -0500, Stefan Novak <stefan.louis.novak@gmail.com> wrote:

> Hi all,
>
> I'm currently in a discussion with team members on designing an HDF5
> structure to contain results from several analyses. One of the topics that
> came up was storing metadata for the primary group which contains a second
> group (called Data) which encapsulates the corresponding datasets of the
> results. One side of the argument is to have a group called "Config" which
> has single-element datasets corresponding to each attribute - the benefit of
> doing this is being able to group "attributes" together if there's a common
> theme between them. The other side of the argument (in which I firmly
> stand) is assigning those attributes to the parent group. Granted you lose
> the ability to group similar attributes, but it seems to be a bit more
> efficient. Here's an illustration:
>
>
> First framework:
> MyGroup (Group)
> -Config (Group)
> -Attribute1 (Dataset)
> -Attribute2 (Dataset)
> -Attribute3 (Dataset)
> -AttributeGroup (Group)
> -Attribute4 (Dataset)
> -Attribute5 (Dataset)
> -Attribute6 (Dataset)
> -Data(Group)
>
>
>
> Second framework:
> MyGroup (Group)
> -Attribute1 (Attribute)
> -Attribute2 (Attribute)
> -Attribute3 (Attribute)
> -Attribute4 (Attribute)
> -Attribute5 (Attribute)
> -Attribute6 (Attribute)
> -Data (Group)
>
>
>
> There are probably going to be 40-50 of these meta-data values. Can someone
> offer some insight as to the efficiency of both schemes?
>
> Thanks!
>

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

werner · August 3, 2009, 8:14pm

Just as comment: one reason to store dimensionality in addition to HDF5's
internal dataset property is

a.) different memory order, like FORTRAN vs. C order (actually a flag
or permutation list for each axis would do as well)

b.) grouping datasets together that have the same dimensionality, such
     as having dimensionality as attribute to a group which contains
     a datasets of same size. Would be same as a "shared dataspace"
     which I remember was once on the HDF5 todo list.

Werner

···

On Mon, 03 Aug 2009 14:35:15 -0500, Mark Miller <miller86@llnl.gov> wrote:

I can't be certain (I could but it'd mean I'd have to go read the manual
right now and I don't have time for that) but I thought HDF5 was 'smart'
about having a slew of 'small' attributes on an object such that it
could read/write them in a single I/O request to the underlying driver
instead of having separate I/O requests for each one. If each one of
your 'attribute' datums is on the size of a few ints or doubles or
whatever, then I think HDF5 would deem them 'small' and handle them
'smartly' meaning single I/O requests. That may be important to you in
terms of performance.

Also, a common thing I see many HDF5 users do is create 'attributes'
which store information natively associated with a dataset by the HDF5
library itself. For example, I'll see cases where users store the
dimensions of a dataset as attribute data associated with the dataset.
That is a waste because HDF5 lib already knows the dimensions of the
dataset. So, thats just something to be aware of; what, if anything, in
your attribute data is already stored by HDF5 natively and then which
HDF5 functions are needed to query that?

Mark

On Fri, 2009-07-31 at 12:56 -0500, Werner Benger wrote:

Hm, using data sets for small data is quite inefficient - maybe not too much in
speed, but certainly in size. I had cases where an HDF5 file was significantly
larger than its corresponding dump as text file.

If you want to group attributes together, have you looked into named datatypes?

When you create an attribute, you need to specify a data type with it. This can
be a predefined datatype like H5T_NATIVE_FLOAT, or something user-defined. These
user-defined data types can be transient, or saved into a file, then becoming
"named data types". These named data types can be equipped with attributes as well.
Then all attributes of this named data types automatically share these same
attributes, implicitely.

That's how I do it in a similar case. Maybe that solution works for you as well.

Werner

On Fri, 31 Jul 2009 10:02:06 -0500, Stefan Novak <stefan.louis.novak@gmail.com> wrote:

> Hi all,
>
> I'm currently in a discussion with team members on designing an HDF5
> structure to contain results from several analyses. One of the topics that
> came up was storing metadata for the primary group which contains a second
> group (called Data) which encapsulates the corresponding datasets of the
> results. One side of the argument is to have a group called "Config" which
> has single-element datasets corresponding to each attribute - the benefit of
> doing this is being able to group "attributes" together if there's a common
> theme between them. The other side of the argument (in which I firmly
> stand) is assigning those attributes to the parent group. Granted you lose
> the ability to group similar attributes, but it seems to be a bit more
> efficient. Here's an illustration:
>
>
> First framework:
> MyGroup (Group)
> -Config (Group)
> -Attribute1 (Dataset)
> -Attribute2 (Dataset)
> -Attribute3 (Dataset)
> -AttributeGroup (Group)
> -Attribute4 (Dataset)
> -Attribute5 (Dataset)
> -Attribute6 (Dataset)
> -Data(Group)
>
>
>
> Second framework:
> MyGroup (Group)
> -Attribute1 (Attribute)
> -Attribute2 (Attribute)
> -Attribute3 (Attribute)
> -Attribute4 (Attribute)
> -Attribute5 (Attribute)
> -Attribute6 (Attribute)
> -Data (Group)
>
>
>
> There are probably going to be 40-50 of these meta-data values. Can someone
> offer some insight as to the efficiency of both schemes?
>
> Thanks!
>

--
___________________________________________________________________________
Dr. Werner Benger <werner@cct.lsu.edu> Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Quincey_Koziol · August 4, 2009, 3:47am

Hi Mark,

I can't be certain (I could but it'd mean I'd have to go read the manual
right now and I don't have time for that) but I thought HDF5 was 'smart'
about having a slew of 'small' attributes on an object such that it
could read/write them in a single I/O request to the underlying driver
instead of having separate I/O requests for each one. If each one of
your 'attribute' datums is on the size of a few ints or doubles or
whatever, then I think HDF5 would deem them 'small' and handle them
'smartly' meaning single I/O requests. That may be important to you in
terms of performance.

Yes, that is the case. If attributes are <64KB, they are stored in the object's header in the file by default and brought in with 1 (or a small #) I/O operation.

Quincey

···

On Aug 3, 2009, at 2:35 PM, Mark Miller wrote:

Also, a common thing I see many HDF5 users do is create 'attributes'
which store information natively associated with a dataset by the HDF5
library itself. For example, I'll see cases where users store the
dimensions of a dataset as attribute data associated with the dataset.
That is a waste because HDF5 lib already knows the dimensions of the
dataset. So, thats just something to be aware of; what, if anything, in
your attribute data is already stored by HDF5 natively and then which
HDF5 functions are needed to query that?

Mark

On Fri, 2009-07-31 at 12:56 -0500, Werner Benger wrote:

Hm, using data sets for small data is quite inefficient - maybe not too much in
speed, but certainly in size. I had cases where an HDF5 file was significantly
larger than its corresponding dump as text file.

If you want to group attributes together, have you looked into named datatypes?

When you create an attribute, you need to specify a data type with it. This can
be a predefined datatype like H5T_NATIVE_FLOAT, or something user-defined. These
user-defined data types can be transient, or saved into a file, then becoming
"named data types". These named data types can be equipped with attributes as well.
Then all attributes of this named data types automatically share these same
attributes, implicitely.

That's how I do it in a similar case. Maybe that solution works for you as well.

  Werner

On Fri, 31 Jul 2009 10:02:06 -0500, Stefan Novak <stefan.louis.novak@gmail.com >> > wrote:

Hi all,

I'm currently in a discussion with team members on designing an HDF5
structure to contain results from several analyses. One of the topics that
came up was storing metadata for the primary group which contains a second
group (called Data) which encapsulates the corresponding datasets of the
results. One side of the argument is to have a group called "Config" which
has single-element datasets corresponding to each attribute - the benefit of
doing this is being able to group "attributes" together if there's a common
theme between them. The other side of the argument (in which I firmly
stand) is assigning those attributes to the parent group. Granted you lose
the ability to group similar attributes, but it seems to be a bit more
efficient. Here's an illustration:

First framework:
MyGroup (Group)
-Config (Group)
    -Attribute1 (Dataset)
       -Attribute2 (Dataset)
       -Attribute3 (Dataset)
       -AttributeGroup (Group)
          -Attribute4 (Dataset)
          -Attribute5 (Dataset)
          -Attribute6 (Dataset)
-Data(Group)

Second framework:
MyGroup (Group)
-Attribute1 (Attribute)
-Attribute2 (Attribute)
-Attribute3 (Attribute)
  -Attribute4 (Attribute)
-Attribute5 (Attribute)
-Attribute6 (Attribute)
-Data (Group)

There are probably going to be 40-50 of these meta-data values. Can someone
offer some insight as to the efficiency of both schemes?

Thanks!

--
Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stefan_Novak · August 4, 2009, 10:30am

Thanks for everyone's input. After pulling a couple of Jedi mind-control
tricks, I was able to convince my team to store those single-element datums
as attributes. For them, they enjoy it because they can view the long list
of attributes at once within the HDFView application. For me, it's easier
to pool the list of attributes for a given object in the tool that I'm
piecing together. Overall, I think everyone is happy.
Thanks again!

···

On Mon, Aug 3, 2009 at 11:47 PM, Quincey Koziol <koziol@hdfgroup.org> wrote:

Hi Mark,

On Aug 3, 2009, at 2:35 PM, Mark Miller wrote:

I can't be certain (I could but it'd mean I'd have to go read the manual

right now and I don't have time for that) but I thought HDF5 was 'smart'
about having a slew of 'small' attributes on an object such that it
could read/write them in a single I/O request to the underlying driver
instead of having separate I/O requests for each one. If each one of
your 'attribute' datums is on the size of a few ints or doubles or
whatever, then I think HDF5 would deem them 'small' and handle them
'smartly' meaning single I/O requests. That may be important to you in
terms of performance.

       Yes, that is the case. If attributes are <64KB, they are stored in
the object's header in the file by default and brought in with 1 (or a small
#) I/O operation.

       Quincey

Also, a common thing I see many HDF5 users do is create 'attributes'

which store information natively associated with a dataset by the HDF5
library itself. For example, I'll see cases where users store the
dimensions of a dataset as attribute data associated with the dataset.
That is a waste because HDF5 lib already knows the dimensions of the
dataset. So, thats just something to be aware of; what, if anything, in
your attribute data is already stored by HDF5 natively and then which
HDF5 functions are needed to query that?

Mark

On Fri, 2009-07-31 at 12:56 -0500, Werner Benger wrote:

Hm, using data sets for small data is quite inefficient - maybe not too
much in
speed, but certainly in size. I had cases where an HDF5 file was
significantly
larger than its corresponding dump as text file.

If you want to group attributes together, have you looked into named
datatypes?

When you create an attribute, you need to specify a data type with it.
This can
be a predefined datatype like H5T_NATIVE_FLOAT, or something
user-defined. These
user-defined data types can be transient, or saved into a file, then
becoming
"named data types". These named data types can be equipped with
attributes as well.
Then all attributes of this named data types automatically share these
same
attributes, implicitely.

That's how I do it in a similar case. Maybe that solution works for you
as well.

       Werner

On Fri, 31 Jul 2009 10:02:06 -0500, Stefan Novak < >>> stefan.louis.novak@gmail.com> wrote:

Hi all,

I'm currently in a discussion with team members on designing an HDF5
structure to contain results from several analyses. One of the topics
that
came up was storing metadata for the primary group which contains a
second
group (called Data) which encapsulates the corresponding datasets of the
results. One side of the argument is to have a group called "Config"
which
has single-element datasets corresponding to each attribute - the
benefit of
doing this is being able to group "attributes" together if there's a
common
theme between them. The other side of the argument (in which I firmly
stand) is assigning those attributes to the parent group. Granted you
lose
the ability to group similar attributes, but it seems to be a bit more
efficient. Here's an illustration:

First framework:
MyGroup (Group)
-Config (Group)
   -Attribute1 (Dataset)
      -Attribute2 (Dataset)
      -Attribute3 (Dataset)
      -AttributeGroup (Group)
         -Attribute4 (Dataset)
         -Attribute5 (Dataset)
         -Attribute6 (Dataset)
-Data(Group)

Second framework:
MyGroup (Group)
-Attribute1 (Attribute)
-Attribute2 (Attribute)
-Attribute3 (Attribute)
-Attribute4 (Attribute)
-Attribute5 (Attribute)
-Attribute6 (Attribute)
-Data (Group)

There are probably going to be 40-50 of these meta-data values. Can
someone
offer some insight as to the efficiency of both schemes?

Thanks!

--

Mark C. Miller, Lawrence Livermore National Laboratory
email: mailto:miller86@llnl.gov
(M/T/W) (925)-423-5901 (!!LLNL BUSINESS ONLY!!)
(Th/F) (530)-753-8511 (!!LLNL BUSINESS ONLY!!)

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Stefan Novak
Sent from Arlington, Virginia, United States

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Attributes vs. Small Datasets