Attributes vs Datasets

John_Biddiscombe · May 4, 2011, 7:10am

If I know that I want to attach a couple of arrays as attributes, and these arrays may be say 1024 elements of double precision max. (say 8kB). The datasets themselves might be xGB.

1) Can I tell hdf5 to reserve a minimum size in the object header so that I know in advance that the attributes will fit (or does the object header only contain pointers to other structures anyway)

2) Is there any performance advantage in terms of fewer metadata transaction to using attributes over datasets in a parallel IO context. (file is written in parallel, only rank zero needs to actually write the attributes)

Question motivated by ....

From the docs ...

"The HDF5 format and I/O library are designed with the assumption that attributes are small datasets. They are always stored in the object header of the object they are attached to. Because of this, large datasets should not be stored as attributes. How large is "large" is not defined by the library and is up to the user's interpretation. (Large datasets with metadata can be stored as supplemental datasets in a group with the primary dataset.)"

Thanks

JB

···

--
John Biddiscombe, email:biddisco @ cscs.ch

CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

Quincey_Koziol · May 4, 2011, 4:13pm

Hi John,

If I know that I want to attach a couple of arrays as attributes, and these arrays may be say 1024 elements of double precision max. (say 8kB). The datasets themselves might be xGB.

1) Can I tell hdf5 to reserve a minimum size in the object header so that I know in advance that the attributes will fit (or does the object header only contain pointers to other structures anyway)

This feature isn't available through the public API currently. (I have thought about exposing the internal part of the library that would allow it, but it seemed like it would only be used by a very small portion of the user base)

2) Is there any performance advantage in terms of fewer metadata transaction to using attributes over datasets in a parallel IO context. (file is written in parallel, only rank zero needs to actually write the attributes)

Hmm, as usual, it depends. Are all the processes going to be accessing the attribute? If not, you could create an attribute with an object reference to an auxiliary dataset, and then read that in dataset in when needed.

Quincey

···

On May 4, 2011, at 2:10 AM, Biddiscombe, John A. wrote:

Question motivated by ....

From the docs ...
"The HDF5 format and I/O library are designed with the assumption that attributes are small datasets. They are always stored in the object header of the object they are attached to. Because of this, large datasets should not be stored as attributes. How large is "large" is not defined by the library and is up to the user's interpretation. (Large datasets with metadata can be stored as supplemental datasets in a group with the primary dataset.)"

Thanks

JB

--
John Biddiscombe, email:biddisco @ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

robl · May 4, 2011, 4:21pm

Seems like something the NetCDF-4 folks could use to good effect, if
they don't already. After exiting NetCDF define mode, the size of the
attributes and objects will be known. NetCDF callers are familiar with
the potential pain of re-entering define mode.

==rob

···

On Wed, May 04, 2011 at 11:13:08AM -0500, Quincey Koziol wrote:

> 1) Can I tell hdf5 to reserve a minimum size in the object header so that I know in advance that the attributes will fit (or does the object header only contain pointers to other structures anyway)

This feature isn't available through the public API currently. (I have thought about exposing the internal part of the library that would allow it, but it seemed like it would only be used by a very small portion of the user base)

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

John_Biddiscombe · May 4, 2011, 7:01pm

Quincey

Hmm, as usual, it depends. Are all the processes going to be accessing the attribute? If not, you could create an attribute with an object reference to an auxiliary dataset, and then read that in dataset in when needed.
<<

No all the rnaks already have the information, it is only written so that post-processsing code can use it. It contains grid spacings/coordinate for irregular x/y/z intervals. it'd be 3 x/y/z for example arrays of 1024 or so elements. None of the ranks needs to read it back. (I'd prefer to write everything into one file for cleanliness).

JB

Quincey_Koziol · May 4, 2011, 4:32pm

Ah, good point. (Although adding new attributes to an existing HDF5 object can be done at any point, without any of the pain that netCDF has for re-entering define mode)

Quincey

···

On May 4, 2011, at 11:21 AM, Rob Latham wrote:

On Wed, May 04, 2011 at 11:13:08AM -0500, Quincey Koziol wrote:

1) Can I tell hdf5 to reserve a minimum size in the object header so that I know in advance that the attributes will fit (or does the object header only contain pointers to other structures anyway)

This feature isn't available through the public API currently. (I have thought about exposing the internal part of the library that would allow it, but it seemed like it would only be used by a very small portion of the user base)

Seems like something the NetCDF-4 folks could use to good effect, if
they don't already. After exiting NetCDF define mode, the size of the
attributes and objects will be known. NetCDF callers are familiar with
the potential pain of re-entering define mode.

Quincey_Koziol · May 4, 2011, 7:28pm

Hi John,

···

On May 4, 2011, at 2:01 PM, Biddiscombe, John A. wrote:

Quincey

Hmm, as usual, it depends. Are all the processes going to be accessing the attribute? If not, you could create an attribute with an object reference to an auxiliary dataset, and then read that in dataset in when needed.
<<

No, all the ranks already have the information, it is only written so that post-processsing code can use it. It contains grid spacings/coordinate for irregular x/y/z intervals. it'd be 3 x/y/z for example arrays of 1024 or so elements. None of the ranks needs to read it back. (I'd prefer to write everything into one file for cleanliness).

Then I might recommend some other way of storing that information, since the object header (containing the attributes) will be read by all the processes when they open the dataset. (It's probably not a big deal though, in the long run)

Quincey