Problem reading version 1.8 HDF5 files using file format specification document. Clarification needed.


#1

Dear HDF5 experts,

I am reading the HDF5 File Format specification and looking at data files created with the 1.8 compatibility settings by the HDF5 library version 1.10 wrapped in the Java layer.

The file has a version 2 superblock (i.e. no Group Leaf Node K and Group Internal Node K) in the header.
The superblock does not refer to a Superblock extension, i.e. there is no reference to a “B-tree ‘K’ Values Message”.
The file contains a chunked dataset which uses Version1 B-Trees. How would a reader know the K value for this B-Tree. There are no defaults established in the file format specification.

Q1: Could you please clarify via the file format specification how the K values should be determined in this case?

The Data Layout message for this dataset is a version 3 layout message. While the dataset was created with a single dimension, its Dimensionality value is set at 2.
The specifification says "This (Dimensionality) specifies the number of dimension size fields later in the message.
That is one would read two dimension size fields from the message. But this does not leave enough message data available to read the Dataset Element Size field at the end.

Q2: Could you please clarify why the dimensionality value is one higher thatn the dimensions? Shall a reader subtract 1 from the value in the dimensionality field to be able to decode the layout together with the dataset element size?

Thank you very much in advance for your support!


#2

Q1. The “B-tree ‘K’ Values Message” is intended for non-default K values. If there isn’t one, the defaults apply. You have a point. (I can’t find them there either.) The defaults should be documented in the file format spec. The defaults are mentioned in the documentation of H5Pset_sym_k (16 and 4) and H5Pset_istore_k (32).

Q2. Can you send us an example? There is no internal adjustment to “dimensionality” or rank of a dataspace. If it says 2, that’s what it is.


#3

Hi Gerd,
thank you for the answer.
AdQ1, as you agree that this should be part of the spec, could you please point out the best place to report these improvement requests. Do you have a tracker for this?

AdQ2,
please take a look at https://support.hdfgroup.org/ftp/HDF5/examples/files/exbyapi/h5ex_d_unlimmod.h5 (one of the published sample files from hdfgroup).
This file has a V1 ObjectHeader at offset 01440 with 6 messages:

0001440 01 00 06 00 01 00 00 00 00 01 00 00 00 00 00 00
0001460 05 00 08 00 01 00 00 00 02 03 02 01 00 00 00 00

The 3rd message of the header is a version 1 Dataspace message:

0001540 01 02 01 00 00 00 00 00 06 00 00 00 00 00 00 00
0001560 0a 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff
0001600 ff ff ff ff ff ff ff ff

It declares 2 dimensions.

The 4th message of the header is a version 3 chunked Datalayout message:

0001620 03 02 03 78 05 00 00 00 00 00 00 04 00 00 00 04
0001640 00 00 00 04 00 00 00 00

This is the message I have trouble reconciling with the specification at https://bitbucket.hdfgroup.org/pages/HDFFV/hdf5doc/master/browse/html/H5.format.html#LayoutMessage

0001620 03 (version 3)
0001621 02 (chunked storage)
0001622 03 (dimensionality)
0001623 78 05 00 00 00 00 00 00 (offset to BTreeV1)
0001633 04 00 00 00 (dimension 1)
0001637 04 00 00. 00 (dimension 2)
0001643 04 00 00 00 (dimension 3 vs. dataset element size)

That leads to my original question, shall I subtract 1 from the dimensionality field? This is specified for the version1 and version2 of the datalayout message, but not for version 3, i.e. again I think this is a flaw of the documentation.

Thank you in advance for clarifying this!


#4

I’m warming up to the idea that there is a mistake in the specification. The description of the Dimensionality field changed from versions 1 and 2 to version 3 as follows:

An array has a fixed dimensionality. This field specifies the number of dimension size fields later in the message. The value stored for chunked storage is 1 greater than the number of dimensions in the dataset’s dataspace. For example, 2 is stored for a 1 dimensional dataset.

turned into

A chunk has a fixed dimensionality. This field specifies the number of dimension size fields later in the message.

I think the 3rd 4 is actually the dataset element size which the v3 layout message spec calls for.

I think the confusion can also be seen in the h5debug output:

% h5debug h5ex_d_unlimmod.h5 800
Reading signature at address 800 (rel)
Object Header...
Dirty:                                             FALSE
Version:                                           1
Header size (in bytes):                            16
Number of links:                                   1
Number of messages (allocated):                    6 (8)
Number of chunks (allocated):                      1 (2)
Chunk 0...
   Address:                                        800
   Size in bytes:                                  256
   Gap:                                            0
Message 0...
   Message ID (sequence number):                   0x0005 `fill_new' (0)
   Dirty:                                          FALSE
   Message flags:                                  <C>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (24, 8) bytes
   Message Information:                           
      Space Allocation Time:                       Incremental
      Fill Time:                                   If Set
      Fill Value Defined:                          Default
      Size:                                        0
      Data type:                                   <dataset type>
Message 1...
   Message ID (sequence number):                   0x0003 `datatype' (0)
   Dirty:                                          FALSE
   Message flags:                                  <C>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (40, 16) bytes
   Message Information:                           
      Type class:                                  integer
      Size:                                        4 bytes
      Version:                                     1
      Byte order:                                  little endian
      Precision:                                   32 bits
      Offset:                                      0 bits
      Low pad type:                                zero
      High pad type:                               zero
      Sign scheme:                                 2's comp
Message 2...
   Message ID (sequence number):                   0x0001 `dataspace' (0)
   Dirty:                                          FALSE
   Message flags:                                  <none>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (64, 40) bytes
   Message Information:                           
      Rank:                                        2
      Dim Size:                                    {6, 10}
      Dim Max:                                     {UNLIM, UNLIM}
Message 3...
   Message ID (sequence number):                   0x0008 `layout' (0)
   Dirty:                                          FALSE
   Message flags:                                  <C>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (112, 24) bytes
   Message Information:                           
      Version:                                     3
      Type:                                        Chunked
      Number of dimensions:                        3
      Size:                                        {4, 4, 4}
      Index Type:                                  v1 B-tree
      Index address:                               1400
Message 4...
   Message ID (sequence number):                   0x0012 `mtime_new' (0)
   Dirty:                                          FALSE
   Message flags:                                  <none>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (144, 8) bytes
   Message Information:                           
      Time:                                        2010-03-18 08:36:44 CDT
Message 5...
   Message ID (sequence number):                   0x0000 `null' (0)
   Dirty:                                          FALSE
   Message flags:                                  <none>
   Chunk number:                                   0
   Raw message data (offset, size) in chunk:       (160, 112) bytes
   Message Information:                           
      <No info for this message>

Yes, I believe, you are right on both points (subtract one and documentation error).

Best, G.