C++: group or link?

Hi,

   I am using the C++ API and trying to figure out how to
find out what an object that could be either a group or a
symbolic link really is. Unfortunately it seems that all
the functions that do that (like H5::CommonFG::getObjinfo())
are deprecated in 1.8 (and I would prefer not to use any
deprecated features). At least I haven't been able yet to
find any function/method that tells me if an object that
I got a name for is a link. Perhaps I did overlook it,
so I would be grateful for some hints.

At the moment I thus resorted to just calling getLinkval()
and catch the error one gets if the name passed to it isn't
a link. That works but I noticed something strange: since I
didn't have a good idea what the second 'size' argument is
supposed to be good for I simply left it out, relying on the
default value of 0 (which I guess is supposed to indicate:
give me the whole name of what the link is pointing to). But
then i observed that the amount of memory consumed by the
program jumps up with each call of the function with incre-
ments in the order of 800 kB to about 7.5 MB. And that memory
never seems to become deallocated. This doesn't happen when
instead of leaving the argument out (or passing 0) I use a
fixed value - in that case memory consumption doesn't change
at all. This, of course, has the disadvantage that things
will go wrong badly if I haven't a good guess on the upper
bould of the length of the name linked to...

Finally, there's another thing perhaps someone can help me
with: I tried to create some 120.000 1D data sets, about
200 bytes large and each in it's own group. This resulted
in a huge overhead in the file: instead of the expected file
size of arond 24 MB (of course plus a bit for overhead) the
files were about 10 times larger than expected. Using a number
(30) of 2D data sets (with 4000 rows) took care of this but I
am curious why this makes such a big difference.

                   Thanks and best regards, Jens

···

--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Jens,

Hi,

  I am using the C++ API and trying to figure out how to
find out what an object that could be either a group or a
symbolic link really is. Unfortunately it seems that all
the functions that do that (like H5::CommonFG::getObjinfo())
are deprecated in 1.8 (and I would prefer not to use any
deprecated features). At least I haven't been able yet to
find any function/method that tells me if an object that
I got a name for is a link. Perhaps I did overlook it,
so I would be grateful for some hints.

At the moment I thus resorted to just calling getLinkval()
and catch the error one gets if the name passed to it isn't
a link. That works but I noticed something strange: since I
didn't have a good idea what the second 'size' argument is
supposed to be good for I simply left it out, relying on the
default value of 0 (which I guess is supposed to indicate:
give me the whole name of what the link is pointing to). But
then i observed that the amount of memory consumed by the
program jumps up with each call of the function with incre-
ments in the order of 800 kB to about 7.5 MB. And that memory
never seems to become deallocated. This doesn't happen when
instead of leaving the argument out (or passing 0) I use a
fixed value - in that case memory consumption doesn't change
at all. This, of course, has the disadvantage that things
will go wrong badly if I haven't a good guess on the upper
bould of the length of the name linked to...

  You probably want to use H5Lexists(), along with H5Oget_info() to check this sort of thing.

Finally, there's another thing perhaps someone can help me
with: I tried to create some 120.000 1D data sets, about
200 bytes large and each in it's own group. This resulted
in a huge overhead in the file: instead of the expected file
size of arond 24 MB (of course plus a bit for overhead) the
files were about 10 times larger than expected. Using a number
(30) of 2D data sets (with 4000 rows) took care of this but I
am curious why this makes such a big difference.

  Did you create them as chunked datasets? And, what were the dimensions of the chunk sizes you used?

  Quincey

···

On Oct 10, 2010, at 5:12 PM, Jens Thoms Toerring wrote:

Hi Quincy,

  thanks for taking the time answering!

> I am using the C++ API and trying to figure out how to
> find out what an object that could be either a group or a
> symbolic link really is. Unfortunately it seems that all
> the functions that do that (like H5::CommonFG::getObjinfo())
> are deprecated in 1.8 (and I would prefer not to use any
> deprecated features). At least I haven't been able yet to
> find any function/method that tells me if an object that
> I got a name for is a link. Perhaps I did overlook it,
> so I would be grateful for some hints.
>
> At the moment I thus resorted to just calling getLinkval()
> and catch the error one gets if the name passed to it isn't
> a link. That works but I noticed something strange: since I
> didn't have a good idea what the second 'size' argument is
> supposed to be good for I simply left it out, relying on the
> default value of 0 (which I guess is supposed to indicate:
> give me the whole name of what the link is pointing to). But
> then i observed that the amount of memory consumed by the
> program jumps up with each call of the function with incre-
> ments in the order of 800 kB to about 7.5 MB. And that memory
> never seems to become deallocated. This doesn't happen when
> instead of leaving the argument out (or passing 0) I use a
> fixed value - in that case memory consumption doesn't change
> at all. This, of course, has the disadvantage that things
> will go wrong badly if I haven't a good guess on the upper
> bould of the length of the name linked to...

  You probably want to use H5Lexists(), along with H5Oget_info() to check this sort of thing.

So I guess there are no functions in the C++ API left for that
kind of thing? And I unfortunately haven't been able to figure
out how to use the functions you mention. H5Lexists() returns
TRUE no matter if the object I am interested in is a group or a
link. And I don't see anything in what I obtain from H5Oget_info()
(or H5Oget_info_by_name()) that would help me. The 'type' field
in the H5O_info_t structure doesn't seem to tell me if this
is a softlink or a group and I also have found nothing that
would indicate how long the resulting name is when following
the link, which seems to be needed for the getLinkval() call
(the default of 0 looking like resulting in a memory leak).
I'm probably missing the obvious but at the moment I have no
ideas left how to proceed.

> Finally, there's another thing perhaps someone can help me
> with: I tried to create some 120.000 1D data sets, about
> 200 bytes large and each in it's own group. This resulted
> in a huge overhead in the file: instead of the expected file
> size of arond 24 MB (of course plus a bit for overhead) the
> files were about 10 times larger than expected. Using a number
> (30) of 2D data sets (with 4000 rows) took care of this but I
> am curious why this makes such a big difference.

  Did you create them as chunked datasets? And, what were the dimensions
of the chunk sizes you used?

No, those were simple 1-dimensional data sets, written out in a
single call immediately after creation and then closed. Perhps
having them all in their own group makes a difference? What I
noticed was that h5dump on the resulting file told me under
Storage information/Groups that for B-tree/List about 140 MB
were used...
                       Thanks and best regards, Jens

···

On Mon, Oct 11, 2010 at 08:04:28AM -0500, Quincey Koziol wrote:

On Oct 10, 2010, at 5:12 PM, Jens Thoms Toerring wrote:

--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Jens,

Hi Quincey,

thanks for taking the time answering!

I am using the C++ API and trying to figure out how to
find out what an object that could be either a group or a
symbolic link really is. Unfortunately it seems that all
the functions that do that (like H5::CommonFG::getObjinfo())
are deprecated in 1.8 (and I would prefer not to use any
deprecated features). At least I haven't been able yet to
find any function/method that tells me if an object that
I got a name for is a link. Perhaps I did overlook it,
so I would be grateful for some hints.

At the moment I thus resorted to just calling getLinkval()
and catch the error one gets if the name passed to it isn't
a link. That works but I noticed something strange: since I
didn't have a good idea what the second 'size' argument is
supposed to be good for I simply left it out, relying on the
default value of 0 (which I guess is supposed to indicate:
give me the whole name of what the link is pointing to). But
then i observed that the amount of memory consumed by the
program jumps up with each call of the function with incre-
ments in the order of 800 kB to about 7.5 MB. And that memory
never seems to become deallocated. This doesn't happen when
instead of leaving the argument out (or passing 0) I use a
fixed value - in that case memory consumption doesn't change
at all. This, of course, has the disadvantage that things
will go wrong badly if I haven't a good guess on the upper
bould of the length of the name linked to...

  You probably want to use H5Lexists(), along with H5Oget_info() to check this sort of thing.

So I guess there are no functions in the C++ API left for that
kind of thing? And I unfortunately haven't been able to figure
out how to use the functions you mention. H5Lexists() returns
TRUE no matter if the object I am interested in is a group or a
link. And I don't see anything in what I obtain from H5Oget_info()
(or H5Oget_info_by_name()) that would help me. The 'type' field
in the H5O_info_t structure doesn't seem to tell me if this
is a softlink or a group and I also have found nothing that
would indicate how long the resulting name is when following
the link, which seems to be needed for the getLinkval() call
(the default of 0 looking like resulting in a memory leak).
I'm probably missing the obvious but at the moment I have no
ideas left how to proceed.

  Hmm, sorry, I missed that you wanted to check for a soft link. You need to use H5Lget_info() for that.

Finally, there's another thing perhaps someone can help me
with: I tried to create some 120.000 1D data sets, about
200 bytes large and each in it's own group. This resulted
in a huge overhead in the file: instead of the expected file
size of arond 24 MB (of course plus a bit for overhead) the
files were about 10 times larger than expected. Using a number
(30) of 2D data sets (with 4000 rows) took care of this but I
am curious why this makes such a big difference.

  Did you create them as chunked datasets? And, what were the dimensions
of the chunk sizes you used?

No, those were simple 1-dimensional data sets, written out in a
single call immediately after creation and then closed. Perhps
having them all in their own group makes a difference? What I
noticed was that h5dump on the resulting file told me under
Storage information/Groups that for B-tree/List about 140 MB
were used...

  This is very weird, can you send a sample program that shows this result?

  Thanks,
    Quincey

···

On Oct 12, 2010, at 2:24 PM, Jens Thoms Toerring wrote:

On Mon, Oct 11, 2010 at 08:04:28AM -0500, Quincey Koziol wrote:

On Oct 10, 2010, at 5:12 PM, Jens Thoms Toerring wrote:

Hi Quincy,

  Hmm, sorry, I missed that you wanted to check for a soft link. You
need to use H5Lget_info() for that.

Entirely my fault since I now realize that I didn't say explicitly
that it's about a symbolic link in my original post. And thanks at
lot, H5Lget_info() looks exactly like what I was looking for!

>>> Finally, there's another thing perhaps someone can help me
>>> with: I tried to create some 120.000 1D data sets, about
>>> 200 bytes large and each in it's own group. This resulted
>>> in a huge overhead in the file: instead of the expected file
>>> size of arond 24 MB (of course plus a bit for overhead) the
>>> files were about 10 times larger than expected. Using a number
>>> (30) of 2D data sets (with 4000 rows) took care of this but I
>>> am curious why this makes such a big difference.
>>
>> Did you create them as chunked datasets? And, what were the dimensions
>> of the chunk sizes you used?
>
> No, those were simple 1-dimensional data sets, written out in a
> single call immediately after creation and then closed. Perhps
> having them all in their own group makes a difference? What I
> noticed was that h5dump on the resulting file told me under
> Storage information/Groups that for B-tree/List about 140 MB
> were used...

  This is very weird, can you send a sample program that shows this result?

As usual this happened within a larger program;-( I will try to
cobble something together that does the same (I hope I have the
version that exhibited the problem somewhere in my version sys-
tem and can just strip it down enough). Please give me a bit of
time, it may take a day or even bit more...

                Thank you very much and best regards, Jens

···

On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Quincy,

>>> Finally, there's another thing perhaps someone can help me
>>> with: I tried to create some 120.000 1D data sets, about
>>> 200 bytes large and each in it's own group. This resulted
>>> in a huge overhead in the file: instead of the expected file
>>> size of arond 24 MB (of course plus a bit for overhead) the
>>> files were about 10 times larger than expected. Using a number
>>> (30) of 2D data sets (with 4000 rows) took care of this but I
>>> am curious why this makes such a big difference.
>>
>> Did you create them as chunked datasets? And, what were the dimensions
>> of the chunk sizes you used?
>
> No, those were simple 1-dimensional data sets, written out in a
> single call immediately after creation and then closed. Perhps
> having them all in their own group makes a difference? What I
> noticed was that h5dump on the resulting file told me under
> Storage information/Groups that for B-tree/List about 140 MB
> were used...

  This is very weird, can you send a sample program that shows this result?

Here's a stripped down version of my original program: it now
just creates 100.000 datasets with 5 doubles, each within its
own group. The amount of "real" data, including strings for
group and dataset names should be about 5 MB, but the file I
get with HDF5, version 1.8.5, is nearly 144 MB large. I expect
a certain amount of overhead, of course, but that ratio was a
bit astonishing;-)

If I leave out the creation of the datasets (i.e. just create
100.000 groups) the size of the file drops to about 80 MB,
so creating a single group seems to "cost" about 800 byte.
Creating just 100.000 datasets (without groups) seems to be
less expensive, here the overhead seems to be in the order
of 350 bytes per dataset. Does that seems reasonable to you?

                            Best regards, Jens

------------- h5_test.cpp ----------------------------------------

#include <iostream>
#include <sstream>
#include <stack>
#include <vector>
#include <string>
#include "H5Cpp.h"

using namespace std;
using namespace H5;

class HDF5Writer {

  public:

    HDF5Writer( H5std_string const & fileName )
    {
        m_file = new H5File( fileName, H5F_ACC_TRUNC );
        m_group = new Group( m_file->openGroup( "/" ) );
    }

    ~HDF5Writer( )
    {
        while ( ! m_group_stack.empty( ) )
            closeGroup( );
        m_group->close( );
        delete m_group;
        m_file->close( );
        delete m_file;
    }

    void createGroup( H5std_string const & name)
    {
        m_group_stack.push( m_group );
        m_group = new Group( m_group->createGroup( name ) );
    }

    void closeGroup( )
    {
        m_group->close( );
        delete m_group;
        m_group = m_group_stack.top( );
        m_group_stack.pop( );
    }

    void writeVector( H5std_string const & name,
                      vector< double > const & data )
    {
        hsize_t dim[ ] = { data.size( ) };
        DataSpace dataspace( 1, dim );
        DataSet dataset( m_group->createDataSet( name, PredType::IEEE_F64LE,
                                                 dataspace ) );
        dataset.write( &data.front( ), PredType::NATIVE_DOUBLE );
        dataset.close( );
        dataspace.close( );
    }

  private:

    H5File * m_file;
    Group * m_group;
    stack< Group * > m_group_stack;
};

int main( )
{
    HDF5Writer w( "test.h5" );
    vector< double > arr( 5, 0 );
                
    for ( size_t i = 0; i < 100000; i++ )
    {
        ostringstream cname;
        cname << "g" << i;
        w.createGroup( cname.str( ) );
        w.writeVector( "d", arr );
        w.closeGroup( );
    }
}

···

On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:

On Oct 12, 2010, at 2:24 PM, Jens Thoms Toerring wrote:

--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Quincy,

> Hmm, sorry, I missed that you wanted to check for a soft link. You
> need to use H5Lget_info() for that.

Sorry for bothering you again: is it the expected behaviour of
H5Lget_info() that it reports the type as H5L_TYPE_HARD in the
H5L_info_t structure if what is passed as the 'link_name' argu-
ment is actually just a group? I was expecting to get back a
negative return value but that only seems to happen if a group
or (soft) link with the specified name does not exist. May I
conclude that a group is basically a hard link?

                     Thanks and best regards, Jens

···

On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:

--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Jens,

Hi Quincey,

Finally, there's another thing perhaps someone can help me
with: I tried to create some 120.000 1D data sets, about
200 bytes large and each in it's own group. This resulted
in a huge overhead in the file: instead of the expected file
size of arond 24 MB (of course plus a bit for overhead) the
files were about 10 times larger than expected. Using a number
(30) of 2D data sets (with 4000 rows) took care of this but I
am curious why this makes such a big difference.

  Did you create them as chunked datasets? And, what were the dimensions
of the chunk sizes you used?

No, those were simple 1-dimensional data sets, written out in a
single call immediately after creation and then closed. Perhps
having them all in their own group makes a difference? What I
noticed was that h5dump on the resulting file told me under
Storage information/Groups that for B-tree/List about 140 MB
were used...

This is very weird, can you send a sample program that shows this result?

Here's a stripped down version of my original program: it now
just creates 100.000 datasets with 5 doubles, each within its
own group. The amount of "real" data, including strings for
group and dataset names should be about 5 MB, but the file I
get with HDF5, version 1.8.5, is nearly 144 MB large. I expect
a certain amount of overhead, of course, but that ratio was a
bit astonishing;-)

If I leave out the creation of the datasets (i.e. just create
100.000 groups) the size of the file drops to about 80 MB,
so creating a single group seems to "cost" about 800 byte.

  About what I'd expect.

Creating just 100.000 datasets (without groups) seems to be
less expensive, here the overhead seems to be in the order
of 350 bytes per dataset. Does that seems reasonable to you?

  That sounds approximately correct also.

  Adding those two numbers together gives me ~115MB. Plus 100,000 * 5 * 8 bytes (for the raw data) brings things up to ~120MB. So there's approximately 24MB "missing" from the equation somewhere. (dark metadata! :slight_smile:

  Pointing h5stat with the "-f -F -g -G -d -D -T -A -s -S" options at the file produced gives only 16488 bytes of unaccounted for space, so not very much space has been wasted due to internal free space fragmentation. There's 27,200,000 bytes of space used for dataset object headers, right around the 300 bytes per dataset you mention, so that's OK. There's 95,291,840 bytes of B-tree information and 13,441,824 bytes of heap information for groups (~1087 bytes per group), which is above the 800 bytes per group that you mention and accounts for the missing space in the file.

  Changing your HDF5Writer constructor to be this:

   HDF5Writer( H5std_string const & fileName )
   {
hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
FileAccPropList FileAccessPList(fapl);
       m_file = new H5File( fileName, H5F_ACC_TRUNC, FileCreatPropList::DEFAULT,
       m_group = new Group( m_file->openGroup( "/" ) );
   }

  (which enables the "latest/latest" option to H5Pset_libver_bounds) give a file that is only 50MB with 41543 bytes of unaccounted space, and only has and ~177 bytes of metadata information per group (although a bit more for the dataset objects at ~284 each, curiously). That's probably a good option for you here, and you could tweak it down further, if you wanted, with the H5Pset_link_phase_change and H5Pset_est_link_info calls. The one drawback of using this option is that the files created will only be able to be read by the 1.8.x releases of the library.

  Quincey

···

On Oct 14, 2010, at 9:43 AM, Jens Thoms Toerring wrote:

On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:

On Oct 12, 2010, at 2:24 PM, Jens Thoms Toerring wrote:

                           Best regards, Jens

------------- h5_test.cpp ----------------------------------------

#include <iostream>
#include <sstream>
#include <stack>
#include <vector>
#include <string>
#include "H5Cpp.h"

using namespace std;
using namespace H5;

class HDF5Writer {

public:

   HDF5Writer( H5std_string const & fileName )
   {
       m_file = new H5File( fileName, H5F_ACC_TRUNC );
       m_group = new Group( m_file->openGroup( "/" ) );
   }

   ~HDF5Writer( )
   {
       while ( ! m_group_stack.empty( ) )
           closeGroup( );
       m_group->close( );
       delete m_group;
       m_file->close( );
       delete m_file;
   }

   void createGroup( H5std_string const & name)
   {
       m_group_stack.push( m_group );
       m_group = new Group( m_group->createGroup( name ) );
   }

   void closeGroup( )
   {
       m_group->close( );
       delete m_group;
       m_group = m_group_stack.top( );
       m_group_stack.pop( );
   }

   void writeVector( H5std_string const & name,
                     vector< double > const & data )
   {
       hsize_t dim[ ] = { data.size( ) };
       DataSpace dataspace( 1, dim );
       DataSet dataset( m_group->createDataSet( name, PredType::IEEE_F64LE,
                                                dataspace ) );
       dataset.write( &data.front( ), PredType::NATIVE_DOUBLE );
       dataset.close( );
       dataspace.close( );
   }

private:

   H5File * m_file;
   Group * m_group;
   stack< Group * > m_group_stack;
};

int main( )
{
   HDF5Writer w( "test.h5" );
   vector< double > arr( 5, 0 );

   for ( size_t i = 0; i < 100000; i++ )
   {
       ostringstream cname;
       cname << "g" << i;
       w.createGroup( cname.str( ) );
       w.writeVector( "d", arr );
       w.closeGroup( );
   }
}

--
\ Jens Thoms Toerring ________ jt@toerring.de
  \_______________________________ http://toerring.de

My understanding is that the creation of an objects within a group (such as another group or dataset) is inherently creating a hard link from that group to the other object. So if you passed in that group or dataset name as the "link name" argument to the H5Lget_info function it should return H5L_TYPE_HARD, as you've stated.

If, however, you've explicitly created a soft-link to another group using the H5Lcreate_soft function and passed _that_ name to the H5Lget_info function, the type would be listed as H5L_TYPE_SOFT.
The following documentation helped clarify HDF5 links for me:
http://www.hdfgroup.org/HDF5/doc/UG/UG_frame09Groups.html

The following source code should clarify what I'm trying to say:

#include "hdf5.h"
#include <iostream>
using std::cout;
using std::endl;

void printLinkInfo(H5L_type_t type)
{
  switch (type)
  {
  case H5L_TYPE_HARD: cout << "H5L_TYPE_HARD"; break;
  case H5L_TYPE_SOFT: cout << "H5L_TYPE_SOFT"; break;
  case H5L_TYPE_EXTERNAL: cout << "H5L_TYPE_EXTERNAL"; break;
  case H5L_TYPE_ERROR: cout << "H5L_TYPE_ERROR"; break;
  }
}

int main()
{
  const char* main_file = "hard_vs_soft.h5";
  hid_t file = H5Fcreate(main_file, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

  hid_t g1 = H5Gcreate(file, "/G1", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  hid_t g2 = H5Gcreate( g1, "G2", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  hid_t g3 = H5Gcreate( g1, "G3", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  hid_t g4 = H5Gcreate(file, "/G4", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  hid_t status = H5Lcreate_soft("/G1/G3", g4, "G5_soft", H5P_DEFAULT, H5P_DEFAULT);
  {
    // test one hard and one soft link to make sure...
    H5L_info_t info;
    status = H5Lget_info(g1, "G2", &info, H5P_DEFAULT);
    cout << "/G1/G2 is "; printLinkInfo(info.type); cout << endl;
    status = H5Lget_info(g4, "G5_soft", &info, H5P_DEFAULT);
    cout << "/G4/G5_soft is "; printLinkInfo(info.type); cout << endl;
  }
  status = H5Gclose(g4);
  status = H5Gclose(g3);
  status = H5Gclose(g2);
  status = H5Gclose(g1);

  status = H5Fclose(file);

  return 0;
}

Richard.

Jens Thoms Toerring wrote:

···

Hi Quincy,

On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:
    

  Hmm, sorry, I missed that you wanted to check for a soft link. You
need to use H5Lget_info() for that.
      
Sorry for bothering you again: is it the expected behaviour of
H5Lget_info() that it reports the type as H5L_TYPE_HARD in the
H5L_info_t structure if what is passed as the 'link_name' argu-
ment is actually just a group? I was expecting to get back a
negative return value but that only seems to happen if a group
or (soft) link with the specified name does not exist. May I
conclude that a group is basically a hard link?

                     Thanks and best regards, Jens

Hi Quincy,

> If I leave out the creation of the datasets (i.e. just create
> 100.000 groups) the size of the file drops to about 80 MB,
> so creating a single group seems to "cost" about 800 byte.

  About what I'd expect.

> Creating just 100.000 datasets (without groups) seems to be
> less expensive, here the overhead seems to be in the order
> of 350 bytes per dataset. Does that seems reasonable to you?

  That sounds approximately correct also.

  Adding those two numbers together gives me ~115MB. Plus 100,000 * 5 * 8
bytes (for the raw data) brings things up to ~120MB. So there's
approximately 24MB "missing" from the equation somewhere. (dark metadata!
:slight_smile:

  Pointing h5stat with the "-f -F -g -G -d -D -T -A -s -S" options at the
file produced gives only 16488 bytes of unaccounted for space, so not very
much space has been wasted due to internal free space fragmentation. There's
27,200,000 bytes of space used for dataset object headers, right around the
300 bytes per dataset you mention, so that's OK. There's 95,291,840 bytes of
B-tree information and 13,441,824 bytes of heap information for groups
(~1087 bytes per group), which is above the 800 bytes per group that you
mention and accounts for the missing space in the file.

Thanks, I see. I hadn't expected that creating a group or data
set would require that much space in the file. In the future
I will try to avoid using excessive amounts of them;-)

  Changing your HDF5Writer constructor to be this:

   HDF5Writer( H5std_string const & fileName )
   {
hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
FileAccPropList FileAccessPList(fapl);
       m_file = new H5File( fileName, H5F_ACC_TRUNC, FileCreatPropList::DEFAULT,
       m_group = new Group( m_file->openGroup( "/" ) );
   }

  (which enables the "latest/latest" option to H5Pset_libver_bounds) give a
file that is only 50MB with 41543 bytes of unaccounted space, and only has
and ~177 bytes of metadata information per group (although a bit more for
the dataset objects at ~284 each, curiously). That's probably a good option
for you here, and you could tweak it down further, if you wanted, with the
H5Pset_link_phase_change and H5Pset_est_link_info calls. The one drawback of
using this option is that the files created will only be able to be read by
the 1.8.x releases of the library.

Thank you for the tips! I guess it's not problem when my program
only supports files written with 1.8.x, so I probably will use
just that.
                   Thank you very much and best regards, Jens

···

On Fri, Oct 15, 2010 at 06:08:19PM -0500, Quincey Koziol wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de

Hi Richard,

My understanding is that the creation of an objects within a group
(such as another group or dataset) is inherently creating a hard
link from that group to the other object. So if you passed in that
group or dataset name as the "link name" argument to the H5Lget_info
function it should return H5L_TYPE_HARD, as you've stated.

If, however, you've explicitly created a soft-link to another group
using the H5Lcreate_soft function and passed _that_ name to the
H5Lget_info function, the type would be listed as H5L_TYPE_SOFT.
The following documentation helped clarify HDF5 links for me:
http://www.hdfgroup.org/HDF5/doc/UG/UG_frame09Groups.html

Thank you, and yes, that clarifies a number of things. I laboured
under the mis-conception that only links created explictly via
H5Lcreate_*() would result in successful calls of H5Lget_link_info().
But then I am still a complete newbie to HDF5 and struggle with
the very basics;-)

The following source code should clarify what I'm trying to say:

<code snipped for brevities sake>

I hope I understand now what's going on...

                   Thank you and best regards, Jens

···

On Wed, Oct 13, 2010 at 09:42:40AM +1100, Richard Khoury wrote:
--
  \ Jens Thoms Toerring ________ jt@toerring.de
   \_______________________________ http://toerring.de