HDF5 C++ API and programming model

Hi HDF5-users,

   I am new to HDF5 but an experienced C++ programmer. Having worked with many
mature open-source libraries I note a few things about the HDF5 C++ API -
please correct me where I am wrong.

    I am aware the workarounds exist for all the issues I raise but I am
simply trying to bring out from experience, areas where I believe the current
HDF5 C++ API clashes with expectations and certain ideal(ized) design
philosophy (IMHO).

    But before I start, please let me express my great appreciation for HDF5
as a scalable, cross-platform, open-source standard for large-volume
computational data storage and transfer, and my gratitude for making it
available as a free download.

ISSUE 1: Excessive/Inappropriate use of TRY-CATCH

hdf5-utils.cpp (1.98 KB)

hdf5-utils.h (2.11 KB)

···

-----------------------------------------------
   We are forced to use try-catch blocks like if-else blocks - there is a
conspicious absence of query functions for checking to see if a group or
dataset exists - instead, we have to call the openGroup or openDataSet
functions and trap the exception if that fails.

   There are a few issues that this creates:

1. It forces an alternate programming approach on otherwise conventional, more
meaningful and readable coding - exceptions now are not just for exceptions
(eg. does the absence of a dataset in an existence query really comprise an
exception or just an expected failure case when the producer and consumer of
the data just happen to be different? )

2. It disallows the use of compiler flags like -fno-exceptions in g++ because
the library is dependent on exceptions to guarantee correctness. Exceptions
cause the compiler to include a heavier runtime into the linked executable
and can have performance implications even if the actual code doesn't use
these features (what the compiler can infer about our code from it's static
analysis is limited). Therefore, by including HDF5 to store partial
calculation results in my nested loops, I am forced to switch
from -fno-exceptions to -fexceptions and risk introducing "excess baggage"
which I could shed previously. This has a cascade effect on my whole code.

   In a nutshell, introducing HDF5 into my code has caused a minor
architecture change to my whole code.

ISSUE 2: ACCESSORS AND QUERIES FOR OBJECT (TYPES)
----------------------------------------------

1. The v1.6.x API allowed querying the type of the object. This allows
switch-case-break blocks to take actions on each sub-item of a group
depending on the case. For example, it is easy to write a (graphical) HDF5
file-browser with such API. IIUC, with v1.8.x, some functions like
CommonFG::getObjtypeByName() are deprecated. But, achieving the above example
use case will now involve a whole bunch of try-catch blocks, each trying to
open a different possible type. For example,

   try { Group subgroup = group.openGroup(name); /* do something*/ }
   catch(Exception const& ex) {}

   try { DataSet ds = group.openDataSet(name); /* do something*/ }
   catch(Exception const& ex) {}

Here, if openGroup succeeds, an openDataSet() attempt will still be performed
unless we use extra flags and if() conditions possibly with goto statements.

  An equivalent switch-case block is more readable and encloses a logical unit
of code that performs a well-defined function, namely, branching of control.

2. To address the above case, it might make sense to introduce different
iterators as in STL. For example, Group::group_iterator,
Group::dataset_iterator, DataSet::attribute_iterator (?)

   These iterators obviate the need to manually apply filters to identify each
child of a parent group. So if there is a need to identify just the datasets
at the current level, the Group::dataset_iterator would help.

ISSUE 3: WRITE API FOR DATASETS
-------------------------------------

1. Once a DataSet object is instantiated with a DataType and DataSpace, the
common-case write of the dataset would normally involve the same datatype
with which it was created. Why do we need to re-mention it during write()?
Understandably, this would help with conversions (I don't know much about
HDF5 conversions). If this is the case, ideally there should *also* be a
write() member function that takes just one parameter - the pointer to the
data buffer - because all other information including DataType is inferrable
from the dataset object. As a beginner I was perplexed till I caught
the "conversions" keyword.
  
   Same thing with read() - if the DataType in-file is not convertible to the
DataType of the DataSet object on which read() is being called then this
would comprise an exception.

2. Writing strings, currently, is a little involved. There could be
convenience functions named "writeString" or even just "write" that take one
string arg. A beginner is faced with questions about fixed-length vs
variable-length vs character-array (with or without the trailing '\0'?)

3. Similarly, writing single integers or floats could be supported using
functions named writeInt(), writeFloat(), writeUInt() etc. which would be
useful for attributes and would hide PredType::NATIVE_INT from a beginner.
Also, I imagine NATIVE_<TYPE> is commonly used so such convenience functions
would allow rapid development without a large learning before first use.

4. Using the type-traits<> template-based techniques along with partial
specialization as in STL and BOOST libraries, it is possible to write short,
simple code that could permit one polymorphic function, say

   template<typename T>
      void writeAtom(H5::Group & g, T const& t, string const& name);

  to write different common atomic types like float, int, string etc. To
illustrate this I am attaching .h and .cpp files where the functions
{write,read}_hdf5_scalar_attribute() are implemented in this way.

ISSUE 4: STANDARD API for COMPLEX TYPES
---------------------------------------

  It is quite common to use complex<float> or complex<double> in mathematical
calculations so it would be nice to have predefined datatypes for these.
Since FORTRAN, C99 and C++ all support complex with up-to long double
precision at the language level, HDF5-support would make life so much easier.

ISSUE 5: H5File API
------------------------

1. Is there a requirement for CommonFG to be a base-class at all? Can't all
included operations be collapsed into just the Group class? To do this with a
file object, just retrieve the root group using file.openGroup("/") and then
work simply with groups. To annotate the H5File itself with meta-info,
provide a separate API. Class hierarchies should represent meaningful
relationships between parents and progeny. The root group in a file is not
the file itself and CommonFG is required only when we mix up the two
definitions (IIUC, IMHO).

2. The H5File contructor supports some H5F_ACC_? parameters that
H5File::open() fails with. This is not documented in the DOXYGEN-generated
API. This is forcing me to include a whole bunch of code within a try-catch
block simply because the H5File object must now be created inside the block
instead of simply using the open() member function - and is therefore visible
only inside the try-catch block!

   IMHO, H5File should follow a model similar to ifstream and ofstream for the
open() and close() functions - while a constructor performs an open(), the
latter can also be performed separately with the same H5F_ACC_? flags.

Thanks,
Manoj Rajagopalan
PhD Candidate, EECS (CSE)
University of Michigan, Ann Arbor

Hi Manoj,

Hi HDF5-users,

  I am new to HDF5 but an experienced C++ programmer. Having worked with many
mature open-source libraries I note a few things about the HDF5 C++ API -
please correct me where I am wrong.

   I am aware the workarounds exist for all the issues I raise but I am
simply trying to bring out from experience, areas where I believe the current
HDF5 C++ API clashes with expectations and certain ideal(ized) design
philosophy (IMHO).

   But before I start, please let me express my great appreciation for HDF5
as a scalable, cross-platform, open-source standard for large-volume
computational data storage and transfer, and my gratitude for making it
available as a free download.

  Thank you for spending the time to write such a detailed and valuable critique of the issues you see with the C++ wrappers, we really appreciate it!

  I've included comments below to address individual points, but I'd also like to introduce a new topic for discussion: how valuable are the current C++ wrappers to experienced C++ developers? I don't think they add much value, because the underlying C layer is reasonably object-oriented and is callable directly from C++. Would the user community be OK with deprecating them and opening the floor to a newer, community driven (and probably developed) set of C++ bindings?

  Quincey

ISSUE 1: Excessive/Inappropriate use of TRY-CATCH
-----------------------------------------------
  We are forced to use try-catch blocks like if-else blocks - there is a
conspicious absence of query functions for checking to see if a group or
dataset exists - instead, we have to call the openGroup or openDataSet
functions and trap the exception if that fails.

  There are a few issues that this creates:

1. It forces an alternate programming approach on otherwise conventional, more
meaningful and readable coding - exceptions now are not just for exceptions
(eg. does the absence of a dataset in an existence query really comprise an
exception or just an expected failure case when the producer and consumer of
the data just happen to be different? )

2. It disallows the use of compiler flags like -fno-exceptions in g++ because
the library is dependent on exceptions to guarantee correctness. Exceptions
cause the compiler to include a heavier runtime into the linked executable
and can have performance implications even if the actual code doesn't use
these features (what the compiler can infer about our code from it's static
analysis is limited). Therefore, by including HDF5 to store partial
calculation results in my nested loops, I am forced to switch
from -fno-exceptions to -fexceptions and risk introducing "excess baggage"
which I could shed previously. This has a cascade effect on my whole code.

  In a nutshell, introducing HDF5 into my code has caused a minor
architecture change to my whole code.

  Yes, I think we went a bit overboard with exceptions in the current C++ wrappers. :slight_smile: Do you have a suggestion for changing them to avoid exceptions?

ISSUE 2: ACCESSORS AND QUERIES FOR OBJECT (TYPES)
----------------------------------------------

1. The v1.6.x API allowed querying the type of the object. This allows
switch-case-break blocks to take actions on each sub-item of a group
depending on the case. For example, it is easy to write a (graphical) HDF5
file-browser with such API. IIUC, with v1.8.x, some functions like
CommonFG::getObjtypeByName() are deprecated. But, achieving the above example
use case will now involve a whole bunch of try-catch blocks, each trying to
open a different possible type. For example,

  try { Group subgroup = group.openGroup(name); /* do something*/ }
  catch(Exception const& ex) {}

  try { DataSet ds = group.openDataSet(name); /* do something*/ }
  catch(Exception const& ex) {}

Here, if openGroup succeeds, an openDataSet() attempt will still be performed
unless we use extra flags and if() conditions possibly with goto statements.

An equivalent switch-case block is more readable and encloses a logical unit
of code that performs a well-defined function, namely, branching of control.

  Hmm, the new H5O* routines in the 1.8 release (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html) haven't been added to the C++ wrappers yet, but I think they should address your concerns here, particularly H5Oget_info_by_name (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5O.html#Object-GetInfoByName). You might also H5Lexists and H5Oexists_by_name (which is new for the 1.8.5 release and is not included in the online documentation yet).

2. To address the above case, it might make sense to introduce different
iterators as in STL. For example, Group::group_iterator,
Group::dataset_iterator, DataSet::attribute_iterator (?)

  These iterators obviate the need to manually apply filters to identify each
child of a parent group. So if there is a need to identify just the datasets
at the current level, the Group::dataset_iterator would help.

  I think that's an interesting and useful idea.

ISSUE 3: WRITE API FOR DATASETS
-------------------------------------

1. Once a DataSet object is instantiated with a DataType and DataSpace, the
common-case write of the dataset would normally involve the same datatype
with which it was created. Why do we need to re-mention it during write()?
Understandably, this would help with conversions (I don't know much about
HDF5 conversions). If this is the case, ideally there should *also* be a
write() member function that takes just one parameter - the pointer to the
data buffer - because all other information including DataType is inferrable
from the dataset object. As a beginner I was perplexed till I caught
the "conversions" keyword.

  Same thing with read() - if the DataType in-file is not convertible to the
DataType of the DataSet object on which read() is being called then this
would comprise an exception.

  It might be nice to make this smoother, but an important aspect of HDF5 is the datatype conversions available.

2. Writing strings, currently, is a little involved. There could be
convenience functions named "writeString" or even just "write" that take one
string arg. A beginner is faced with questions about fixed-length vs
variable-length vs character-array (with or without the trailing '\0'?)

3. Similarly, writing single integers or floats could be supported using
functions named writeInt(), writeFloat(), writeUInt() etc. which would be
useful for attributes and would hide PredType::NATIVE_INT from a beginner.
Also, I imagine NATIVE_<TYPE> is commonly used so such convenience functions
would allow rapid development without a large learning before first use.

  Points taken, thanks! :slight_smile:

4. Using the type-traits<> template-based techniques along with partial
specialization as in STL and BOOST libraries, it is possible to write short,
simple code that could permit one polymorphic function, say

  template<typename T>
     void writeAtom(H5::Group & g, T const& t, string const& name);

to write different common atomic types like float, int, string etc. To
illustrate this I am attaching .h and .cpp files where the functions
{write,read}_hdf5_scalar_attribute() are implemented in this way.

  Nifty, thanks again!

ISSUE 4: STANDARD API for COMPLEX TYPES
---------------------------------------

It is quite common to use complex<float> or complex<double> in mathematical
calculations so it would be nice to have predefined datatypes for these.
Since FORTRAN, C99 and C++ all support complex with up-to long double
precision at the language level, HDF5-support would make life so much easier.

  We are planning to extend the predefined HDF5 datatypes to support complex datatypes in the 1.10.0 release. (Although, we haven't absolutely committed to this yet, since it's a fair bit of work)

ISSUE 5: H5File API
------------------------

1. Is there a requirement for CommonFG to be a base-class at all? Can't all
included operations be collapsed into just the Group class? To do this with a
file object, just retrieve the root group using file.openGroup("/") and then
work simply with groups. To annotate the H5File itself with meta-info,
provide a separate API. Class hierarchies should represent meaningful
relationships between parents and progeny. The root group in a file is not
the file itself and CommonFG is required only when we mix up the two
definitions (IIUC, IMHO).

  Hmm, I'm not certain why we implemented things this way, but you do have a good point.

2. The H5File contructor supports some H5F_ACC_? parameters that
H5File::open() fails with. This is not documented in the DOXYGEN-generated
API. This is forcing me to include a whole bunch of code within a try-catch
block simply because the H5File object must now be created inside the block
instead of simply using the open() member function - and is therefore visible
only inside the try-catch block!

  IMHO, H5File should follow a model similar to ifstream and ofstream for the
open() and close() functions - while a constructor performs an open(), the
latter can also be performed separately with the same H5F_ACC_? flags.

  I think this is a bug, yes.

···

On Apr 26, 2010, at 1:00 PM, Manoj Rajagopalan wrote:

Thanks,
Manoj Rajagopalan
PhD Candidate, EECS (CSE)
University of Michigan, Ann Arbor
<hdf5-utils.cpp><hdf5-utils.h>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

  I've included comments below to address individual points, but I'd also like to introduce a new topic for discussion: how valuable are the current C++ wrappers to experienced C++ developers? I don't think they add much value, because the underlying C layer is reasonably object-oriented and is callable directly from C++. Would the user community be OK with deprecating them and opening the floor to a newer, community driven (and probably developed) set of C++ bindings?

  Quincey

One conceptual question with the C++ wrapper is how to use extension libraries to HDF5 that are using the C interface. If you want to use the C++ API, but also some high-level C library on top of HDF5, then it one would need to mix the C API usage and the C++ API, and it's just better to directly stay at the C API then.

Technically, it should be ensured that the C++ wrapper is merely a wrapper, and doesn't introduce runtime overhead and more functionality rather than improved syntax/semantic checking of correct usage for HDF5 objects. Using type traits to map native C++ types or user-defined structures to HDF5 types and object certainly needs to be part of it, that is just state of the art.

On exceptions, I was reluctant on using them for a long time as well, in particular as compilers had not been mature enough (such as to catch/throw exceptions across windows DLLs has reportedly be a long-term problem), but this situation appears to have changed and they are save to use nowadays. Even more, they're even important to use, since the standard library throws exceptions, for instance when out of memory during a new() call, or as part of the standard template library. Exceptions have become unavoidable, and C++ code just has to be exception-safe, which also requires some reference-counting scheme using smart pointers, auto pointers and all this.

As a major anti-C++'ish design in the HDF5 library I see its iterator concept - it requires one to define a callback routine, instead of having an iterator object that can be incremented and allows to construct a for(;:wink: loop over iterators, similar to STL iterators. But this is already based in the HDF5 C library, if the HDF5 C++ wrapper tries to emulate iterators through the HDF5 C callback function, it cannot be efficient, so that would be something to be exposed on a deeper level.

On a community driven development of a HDF5 C++ library - it might be difficult, because there are just too many styles and flavors of how to use C++. It would require at least one lead developer, or a group of lead developers who are consistent in their C++ style and usage.

It might possibly be beneficial to just optionally compile all of the current HDF5 C code as C++ - just for the sake of C++ being more critical on many programming behaviors where C is sloppy about; various things that are just warnings in C are errors in C++, so it could allow to catch some coding problems.

  Werner

···

On Tue, 27 Apr 2010 09:22:20 -0300, Quincey Koziol <koziol@hdfgroup.org> wrote:

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Hi Quincey,

       I've included comments below to address individual points, but I'd
also like to introduce a new topic for discussion: how valuable are the
current C++ wrappers to experienced C++ developers? I don't think they add
much value, because the underlying C layer is reasonably object-oriented and
is callable directly from C++. Would the user community be OK with
deprecating them and opening the floor to a newer, community driven (and
probably developed) set of C++ bindings?

       Quincey

This is exactly the conclusion that I came to - another example of this kind
of problem is with the MPI C++ bindings - they became deprecated because
people didn't use them; the boost MPI library became a popular alternative.

One conceptual question with the C++ wrapper is how to use extension
libraries to HDF5 that are using the C interface. If you want to use the C++
API, but also some high-level C library on top of HDF5, then it one would
need to mix the C API usage and the C++ API, and it's just better to
directly stay at the C API then.

Technically, it should be ensured that the C++ wrapper is merely a wrapper,
and doesn't introduce runtime overhead and more functionality rather than
improved syntax/semantic checking of correct usage for HDF5 objects. Using
type traits to map native C++ types or user-defined structures to HDF5 types
and object certainly needs to be part of it, that is just state of the art.

I disagree that it shouldn't add functionality; I think that where possible
we should map C++ like constructs over the top of HDF5. I'd like to see a
generator type interface that can dynamically fill a dataset from a
function, and potentially having the dataset larger than the amount of
memory on the node. Using HDF5 as the backend to a streaming interface such
as http://stxxl.sourceforge.net/ to enable easy out of core processing using
HDF5 would be a very attractive feature as a drop in replacement for in
memory data structures.

On exceptions, I was reluctant on using them for a long time as well, in
particular as compilers had not been mature enough (such as to catch/throw
exceptions across windows DLLs has reportedly be a long-term problem), but
this situation appears to have changed and they are save to use nowadays.
Even more, they're even important to use, since the standard library throws
exceptions, for instance when out of memory during a new() call, or as part
of the standard template library. Exceptions have become unavoidable, and
C++ code just has to be exception-safe, which also requires some
reference-counting scheme using smart pointers, auto pointers and all this.

Exceptions are kind of key to the whole RAII idiom however it all depends
upon your motivation for using C++; RAII is only one reason to use C++; for
some people the type traits stuff for auto detection of datatypes is enough
motivation to use C++; its possible to write a C++ API that only provides
this functionality and leaves the resource management to the user as in the
C interface.

As a major anti-C++'ish design in the HDF5 library I see its iterator
concept - it requires one to define a callback routine, instead of having an
iterator object that can be incremented and allows to construct a for(;:wink:
loop over iterators, similar to STL iterators. But this is already based in
the HDF5 C library, if the HDF5 C++ wrapper tries to emulate iterators
through the HDF5 C callback function, it cannot be efficient, so that would
be something to be exposed on a deeper level.

On a community driven development of a HDF5 C++ library - it might be
difficult, because there are just too many styles and flavors of how to use
C++. It would require at least one lead developer, or a group of lead
developers who are consistent in their C++ style and usage.

I did try to spark some interest around this a couple of months back; my
code is available at http://github.com/jsharpe/hdf5. It'd be great to get a
few more users of the library to help develop the interface a bit, my
current developments are driven by requirements for a CFD solver in
particular in parallel we will be doing independant IO so the API won't
necessarily be sufficient for doing collective IO.

I'd suggest that rather than trying to have a single lead developer that we
use tools like github to manage forks and branches and iterate on the design
by getting people to actually use it in their applications; the best
interfaces will evolve when they fit the use cases of the users that are
actually going to use the API.

It might possibly be beneficial to just optionally compile all of the
current HDF5 C code as C++ - just for the sake of C++ being more critical on
many programming behaviors where C is sloppy about; various things that are
just warnings in C are errors in C++, so it could allow to catch some coding
problems.

There is little point to this; so long as we can include hdf5.h in a c++
compiled module without errors then is sufficient.

James

···

On 27 April 2010 13:41, Werner Benger <werner@cct.lsu.edu> wrote:

On Tue, 27 Apr 2010 09:22:20 -0300, Quincey Koziol <koziol@hdfgroup.org> > wrote: