Datatype sizes and object headers

Francesc_Alted2 · February 26, 2009, 9:29am

Hi,

I'm writing this because several PyTables users complained about an
issue in HDF5. The thing boils down to, apparently, HDF5 taking a lot
of space to represent large datatypes. For example, in order to create
a compound datatype with a single component of ARRAY type of a certain
length, when this length exceeds some value, it gives the next error:

  #005: ../../../src/H5O.c line 2204 in H5O_new_mesg(): object header
message is too large (16k max)
    major(12): Object header layer
    minor(29): Unable to initialize object

[This was using HDF5 1.6.5]

Apparently, it seems like the ARRAY type cannot be represented by
keeping just its definition and length, but by replicating the type
description along the object header (otherwise I cannot understand the
above error).

Also, it seems that different HDF5 versions have different limits for
the object header message. With 1.6.5 the limit seems to be 64k
(despite the error message shown above, claiming to be 16k), but with
HDF5 1.6.6 (and some users reported 1.8.0 having this behaviour too),
the datatypes can be much more larger (122k or more). With 1.8.2, the
limit seems to be 64k again. I'm curious about why this limit is
changing from version to version.

For your reference, here it is the kind of datatypes that trigger the
problem (h5ls output):

Type: struct {
"signal" +0 [32768] native signed char
} 32768 bytes

For ARRAY types with 2^15 elements (32768), HDF5 can create it.
However, for ARRAY types with 2^16, some versions (not all) of HDF5
cannot.

My question is: is there a way to avoid this problem and define
potentially large ARRAY types without exceeding the object header
capacity? In case this is not going to be possible, is there a way to
query the HDF5 library to guess the maximum type sizes? I'd like to do
this in order to be able to raise a specific Python exception whenever
this happens.

Thanks,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · February 26, 2009, 12:48pm

Hi Francesc,

Hi,

I'm writing this because several PyTables users complained about an
issue in HDF5. The thing boils down to, apparently, HDF5 taking a lot
of space to represent large datatypes. For example, in order to create
a compound datatype with a single component of ARRAY type of a certain
length, when this length exceeds some value, it gives the next error:

#005: ../../../src/H5O.c line 2204 in H5O_new_mesg(): object header
message is too large (16k max)
   major(12): Object header layer
   minor(29): Unable to initialize object

[This was using HDF5 1.6.5]

Apparently, it seems like the ARRAY type cannot be represented by
keeping just its definition and length, but by replicating the type
description along the object header (otherwise I cannot understand the
above error).

Also, it seems that different HDF5 versions have different limits for
the object header message. With 1.6.5 the limit seems to be 64k
(despite the error message shown above, claiming to be 16k), but with
HDF5 1.6.6 (and some users reported 1.8.0 having this behaviour too),
the datatypes can be much more larger (122k or more). With 1.8.2, the
limit seems to be 64k again. I'm curious about why this limit is
changing from version to version.

For your reference, here it is the kind of datatypes that trigger the
problem (h5ls output):

Type: struct {
               "signal" +0 [32768] native signed char
          } 32768 bytes

For ARRAY types with 2^15 elements (32768), HDF5 can create it.
However, for ARRAY types with 2^16, some versions (not all) of HDF5
cannot.

My question is: is there a way to avoid this problem and define
potentially large ARRAY types without exceeding the object header
capacity? In case this is not going to be possible, is there a way to
query the HDF5 library to guess the maximum type sizes? I'd like to do
this in order to be able to raise a specific Python exception whenever
this happens.

That is possibly odd - are you creating a dataset or an attribute with this datatype? If you are creating a dataset, then this may be a bug. If you are creating an attribute, then the attribute needs to reserve room for the element(s) of its datatype when it's created, so very large datatypes may not be able to fit in the object header.

Quincey

P.S. - We should definitely change the error message to reflect that the limit is ~64k, not 16k...

···

On Feb 26, 2009, at 3:29 AM, Francesc Alted wrote:

Francesc_Alted2 · February 26, 2009, 1:25pm

A Thursday 26 February 2009, Quincey Koziol escrigué:

Hi Francesc,

> Hi,
>
> I'm writing this because several PyTables users complained about an
> issue in HDF5. The thing boils down to, apparently, HDF5 taking a
> lot of space to represent large datatypes. For example, in order
> to create
> a compound datatype with a single component of ARRAY type of a
> certain length, when this length exceeds some value, it gives the
> next error:
>
> #005: ../../../src/H5O.c line 2204 in H5O_new_mesg(): object
> header message is too large (16k max)
> major(12): Object header layer
> minor(29): Unable to initialize object
>
> [This was using HDF5 1.6.5]
>
> Apparently, it seems like the ARRAY type cannot be represented by
> keeping just its definition and length, but by replicating the type
> description along the object header (otherwise I cannot understand
> the above error).
>
> Also, it seems that different HDF5 versions have different limits
> for the object header message. With 1.6.5 the limit seems to be
> 64k (despite the error message shown above, claiming to be 16k),
> but with HDF5 1.6.6 (and some users reported 1.8.0 having this
> behaviour too), the datatypes can be much more larger (122k or
> more). With 1.8.2, the limit seems to be 64k again. I'm curious
> about why this limit is changing from version to version.
>
> For your reference, here it is the kind of datatypes that trigger
> the problem (h5ls output):
>
> Type: struct {
> "signal" +0 [32768] native signed char
> } 32768 bytes
>
> For ARRAY types with 2^15 elements (32768), HDF5 can create it.
> However, for ARRAY types with 2^16, some versions (not all) of HDF5
> cannot.
>
> My question is: is there a way to avoid this problem and define
> potentially large ARRAY types without exceeding the object header
> capacity? In case this is not going to be possible, is there a way
> to query the HDF5 library to guess the maximum type sizes? I'd
> like to do
> this in order to be able to raise a specific Python exception
> whenever this happens.

That is possibly odd - are you creating a dataset or an attribute
with this datatype? If you are creating a dataset, then this may be
a bug. If you are creating an attribute, then the attribute needs to
reserve room for the element(s) of its datatype when it's created, so
very large datatypes may not be able to fit in the object header.

No, datasets. Here it is the dump for a not-so-large dataset that HDF5
allows to create them:

   DATASET "test" {
      DATATYPE H5T_COMPOUND {
         H5T_ARRAY { [32768] H5T_STD_I8LE } "signal";
      }
      DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
      DATA {
      }

P.S. - We should definitely change the error message to reflect that
the limit is ~64k, not 16k...

Well, this message appears with HDF5 1.6 series. For example, with HDF5
1.8.2 the error is:

  #007: H5O.c line 2531 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 278 in H5O_dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1139 in H5D_create(): can't update the metadata
cache
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dint.c line 846 in H5D_update_oh_info(): unable to update new
fill value header message
    major: Dataset
    minor: Unable to initialize object
  #011: H5Omessage.c line 188 in H5O_msg_append_oh(): unable to create
new message in header
    major: Attribute
    minor: Unable to insert object
  #012: H5Omessage.c line 228 in H5O_msg_append_real(): unable to create
new message
    major: Object header
    minor: No space available for allocation
  #013: H5Omessage.c line 1936 in H5O_msg_alloc(): unable to allocate
space for message
    major: Object header
    minor: Unable to initialize object
  #014: H5Oalloc.c line 972 in H5O_alloc(): object header message is too
large
    major: Object header
    minor: Unable to initialize object

So, by not specifying the object header message limit you have fixed the
problem (although I'd definitely prefer to have a numerical hint on the
maximum size allowed

Thanks,

···

On Feb 26, 2009, at 3:29 AM, Francesc Alted wrote:

--
Francesc Alted

Quincey_Koziol · February 26, 2009, 1:40pm

Hi Francesc,

A Thursday 26 February 2009, Quincey Koziol escrigué:

Hi Francesc,

Hi,

I'm writing this because several PyTables users complained about an
issue in HDF5. The thing boils down to, apparently, HDF5 taking a
lot of space to represent large datatypes. For example, in order
to create
a compound datatype with a single component of ARRAY type of a
certain length, when this length exceeds some value, it gives the
next error:

#005: ../../../src/H5O.c line 2204 in H5O_new_mesg(): object
header message is too large (16k max)
  major(12): Object header layer
  minor(29): Unable to initialize object

[This was using HDF5 1.6.5]

Apparently, it seems like the ARRAY type cannot be represented by
keeping just its definition and length, but by replicating the type
description along the object header (otherwise I cannot understand
the above error).

Also, it seems that different HDF5 versions have different limits
for the object header message. With 1.6.5 the limit seems to be
64k (despite the error message shown above, claiming to be 16k),
but with HDF5 1.6.6 (and some users reported 1.8.0 having this
behaviour too), the datatypes can be much more larger (122k or
more). With 1.8.2, the limit seems to be 64k again. I'm curious
about why this limit is changing from version to version.

For your reference, here it is the kind of datatypes that trigger
the problem (h5ls output):

Type: struct {
              "signal" +0 [32768] native signed char
         } 32768 bytes

For ARRAY types with 2^15 elements (32768), HDF5 can create it.
However, for ARRAY types with 2^16, some versions (not all) of HDF5
cannot.

My question is: is there a way to avoid this problem and define
potentially large ARRAY types without exceeding the object header
capacity? In case this is not going to be possible, is there a way
to query the HDF5 library to guess the maximum type sizes? I'd
like to do
this in order to be able to raise a specific Python exception
whenever this happens.

  That is possibly odd - are you creating a dataset or an attribute
with this datatype? If you are creating a dataset, then this may be
a bug. If you are creating an attribute, then the attribute needs to
reserve room for the element(s) of its datatype when it's created, so
very large datatypes may not be able to fit in the object header.

No, datasets. Here it is the dump for a not-so-large dataset that HDF5
allows to create them:

  DATASET "test" {
     DATATYPE H5T_COMPOUND {
        H5T_ARRAY { [32768] H5T_STD_I8LE } "signal";
     }
     DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
     DATA {
     }

Hmm, can you send a sample program (in C) that fails?

P.S. - We should definitely change the error message to reflect that
the limit is ~64k, not 16k...

Well, this message appears with HDF5 1.6 series. For example, with HDF5
1.8.2 the error is:

#007: H5O.c line 2531 in H5O_obj_create(): unable to open object
   major: Object header
   minor: Can't open object
#008: H5Doh.c line 278 in H5O_dset_create(): unable to create dataset
   major: Dataset
   minor: Unable to initialize object
#009: H5Dint.c line 1139 in H5D_create(): can't update the metadata
cache
   major: Dataset
   minor: Unable to initialize object
#010: H5Dint.c line 846 in H5D_update_oh_info(): unable to update new
fill value header message
   major: Dataset
   minor: Unable to initialize object
#011: H5Omessage.c line 188 in H5O_msg_append_oh(): unable to create
new message in header
   major: Attribute
   minor: Unable to insert object
#012: H5Omessage.c line 228 in H5O_msg_append_real(): unable to create
new message
   major: Object header
   minor: No space available for allocation
#013: H5Omessage.c line 1936 in H5O_msg_alloc(): unable to allocate
space for message
   major: Object header
   minor: Unable to initialize object
#014: H5Oalloc.c line 972 in H5O_alloc(): object header message is too
large
   major: Object header
   minor: Unable to initialize object

So, by not specifying the object header message limit you have fixed the
problem (although I'd definitely prefer to have a numerical hint on the
maximum size allowed

Ah, good.

Quincey

···

On Feb 26, 2009, at 7:25 AM, Francesc Alted wrote:

On Feb 26, 2009, at 3:29 AM, Francesc Alted wrote:

Francesc_Alted2 · February 26, 2009, 5:27pm

Quincey,

A Thursday 26 February 2009, Quincey Koziol escrigué:

>> That is possibly odd - are you creating a dataset or an attribute
>> with this datatype? If you are creating a dataset, then this may
>> be a bug. If you are creating an attribute, then the attribute
>> needs to reserve room for the element(s) of its datatype when it's
>> created, so very large datatypes may not be able to fit in the
>> object header.
>
> No, datasets. Here it is the dump for a not-so-large dataset that
> HDF5
> allows to create them:
>
> DATASET "test" {
> DATATYPE H5T_COMPOUND {
> H5T_ARRAY { [32768] H5T_STD_I8LE } "signal";
> }
> DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
> DATA {
> }

Hmm, can you send a sample program (in C) that fails?

Attached. While trying to reproduce that, I think I know what happens.
It turns out that PyTables always tries to save default values (by
using the H5Pset_fill_value() call), and this is precisely the data
that would not fit in the object header.

Well, I suppose that the immediate fix for me is not to write default
values by default (no pun intended). However, this is going to be
difficult, as people can already be used to the defaults already.
Mmh... Ideas?

Thanks,

large_datatype_with_defaults.c (1.98 KB)

···

--
Francesc Alted

Quincey_Koziol · February 26, 2009, 6:05pm

Hi Francesc,

···

On Feb 26, 2009, at 11:27 AM, Francesc Alted wrote:

Quincey,

A Thursday 26 February 2009, Quincey Koziol escrigué:

  That is possibly odd - are you creating a dataset or an attribute
with this datatype? If you are creating a dataset, then this may
be a bug. If you are creating an attribute, then the attribute
needs to reserve room for the element(s) of its datatype when it's
created, so very large datatypes may not be able to fit in the
object header.

No, datasets. Here it is the dump for a not-so-large dataset that
HDF5
allows to create them:

DATASET "test" {
    DATATYPE H5T_COMPOUND {
       H5T_ARRAY { [32768] H5T_STD_I8LE } "signal";
    }
    DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
    DATA {
    }

  Hmm, can you send a sample program (in C) that fails?

Attached. While trying to reproduce that, I think I know what happens.
It turns out that PyTables always tries to save default values (by
using the H5Pset_fill_value() call), and this is precisely the data
that would not fit in the object header.

Well, I suppose that the immediate fix for me is not to write default
values by default (no pun intended). However, this is going to be
difficult, as people can already be used to the defaults already.
Mmh... Ideas?

Ah, tying it to the fill values would make sense. Hmm, we should log this as a bug and address it, since it should be possible for applications to perform this operation.

Thanks,
Quincey

nfortne2 · February 26, 2009, 9:00pm

Francesc,

Quoting Francesc Alted <faltet@pytables.org>:

Quincey,

A Thursday 26 February 2009, Quincey Koziol escrigué:

>> That is possibly odd - are you creating a dataset or an attribute
>> with this datatype? If you are creating a dataset, then this may
>> be a bug. If you are creating an attribute, then the attribute
>> needs to reserve room for the element(s) of its datatype when it's
>> created, so very large datatypes may not be able to fit in the
>> object header.
>
> No, datasets. Here it is the dump for a not-so-large dataset that
> HDF5
> allows to create them:
>
> DATASET "test" {
> DATATYPE H5T_COMPOUND {
> H5T_ARRAY { [32768] H5T_STD_I8LE } "signal";
> }
> DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
> DATA {
> }

Hmm, can you send a sample program (in C) that fails?

Attached. While trying to reproduce that, I think I know what happens.
It turns out that PyTables always tries to save default values (by
using the H5Pset_fill_value() call), and this is precisely the data
that would not fit in the object header.

Well, I suppose that the immediate fix for me is not to write default
values by default (no pun intended). However, this is going to be
difficult, as people can already be used to the defaults already.
Mmh... Ideas?

Have you tried using only H5Pset_fill_time (with H5D_FILL_TIME_ALLOC), and not calling H5Pset_fill_value? This of course assumes that the library default (0) is good enough.

Thanks,
-Neil Fortner

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted2 · February 26, 2009, 6:28pm

A Thursday 26 February 2009, Quincey Koziol escrigué:

> Well, I suppose that the immediate fix for me is not to write
> default values by default (no pun intended). However, this is
> going to be difficult, as people can already be used to the
> defaults already. Mmh... Ideas?

Ah, tying it to the fill values would make sense. Hmm, we should
log this as a bug and address it, since it should be possible for
applications to perform this operation.

I've been thinking about it, and a possibility would be to enable a
conversion path in HDF5 for setting the defaults for ARRAY types from a
single scalar value. This way, H5Pset_fill_value() could be used even
when huge ARRAY types are used, with the limitation that only one
default value (the scalar) for the complete ARRAY type would be used.

In fact, the PyTables machinery already allows doing this. If an array
is found in defaults, its values will be set as default. If it is
found a scalar, this value will be broadcasted to all the ARRAY type.
If HDF5 would allow something similar it would be really cool.

I was curious to see if this conversion path (broadcasting a scalar
value to an array) was already enabled in HDF5, but it doesn't seem so.
The attached test gives the next error:

HDF5-DIAG: Error detected in HDF5 (1.8.2) thread 0:
  #000: H5Ddeprec.c line 170 in H5Dcreate1(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 429 in H5D_create_named(): unable to create and
link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1633 in H5L_link_object(): unable to create new link
to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1856 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 877 in H5G_traverse(): internal path
traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal
operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1679 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 2531 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 278 in H5O_dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1139 in H5D_create(): can't update the metadata
cache
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dint.c line 805 in H5D_update_oh_info(): unable to convert
fill value to dataset type
    major: Dataset
    minor: Unable to initialize object
  #011: H5Ofill.c line 937 in H5O_fill_convert(): unable to convert
between src and dst datatypes
    major: Datatype
    minor: Unable to initialize object
  #012: H5T.c line 4400 in H5T_path_find(): no appropriate function for
conversion path
    major: Datatype
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.2) thread 0:
  #000: H5D.c line 379 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type

Regards,

large_datatype_with_scalar_defaults.c (2.3 KB)

···

--
Francesc Alted

Francesc_Alted2 · February 26, 2009, 10:49pm

Hi Neil,

A Thursday 26 February 2009, nfortne2@hdfgroup.org escrigué:

> Well, I suppose that the immediate fix for me is not to write
> default values by default (no pun intended). However, this is
> going to be difficult, as people can already be used to the
> defaults already. Mmh... Ideas?

Have you tried using only H5Pset_fill_time (with
H5D_FILL_TIME_ALLOC), and not calling H5Pset_fill_value? This of
course assumes that the library default (0) is good enough.

Yeah, but I'd like to keep default values other than 0.

Thanks anyway,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted2 · February 27, 2009, 11:47am

A Thursday 26 February 2009, nfortne2@hdfgroup.org escrigué:

> Well, I suppose that the immediate fix for me is not to write
> default values by default (no pun intended). However, this is
> going to be difficult, as people can already be used to the
> defaults already. Mmh... Ideas?

Have you tried using only H5Pset_fill_time (with
H5D_FILL_TIME_ALLOC), and not calling H5Pset_fill_value? This of
course assumes that the library default (0) is good enough.

Hmm, after re-thinking about this suggestion, I have now implemented a
way to create datasets with very large datatypes having zero defaults,
and it works pretty well! And although this is more a workaround than
a real solution, at least would let people to create such special
datasets with PyTables.

Many thanks Neil!

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.