B-Tree size for chunked dataset

Salnikov_Andrei_A · August 31, 2010, 2:06am

Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.

I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct? Is there a way to
reduce the size?

Cheers,
Andy

···

================================================================

File information
        # of unique groups: 31107
        # of unique datasets: 58767
        # of unique named datatypes: 0
        # of unique links: 0
        # of unique other: 0
        Max. # of links to object: 1
        Max. # of objects in group: 201
Object header size: (total/unused)
        Groups: 1311904/0
        Datasets: 27052128/16288
        Datatypes: 0/0
Storage information:
        Groups:
                B-tree/List: 28798864
                Heap: 4964928
        Attributes:
                B-tree/List: 0
                Heap: 0
        Chunked datasets:
                B-tree: 115116512
        Shared Messages:
                Header: 0
                B-tree/List: 0
                Heap: 0
        Superblock extension: 0
Small groups:
        # of groups of size 1: 1222
        # of groups of size 2: 27663
        # of groups of size 3: 1211
        # of groups of size 4: 203
        # of groups of size 5: 403
        Total # of small groups: 30702
Group bins:
        # of groups of size 1 - 9: 30702
        # of groups of size 10 - 99: 202
        # of groups of size 100 - 999: 203
        Total # of groups: 31107
Dataset dimension information:
        Max. rank of datasets: 1
        Dataset ranks:
                # of dataset with rank 0: 2230
                # of dataset with rank 1: 56537
1-D Dataset information:
        Max. dimension size of 1-D datasets: 600
        Small 1-D datasets:
                # of dataset dimensions of size 1: 1461
                # of dataset dimensions of size 4: 202
                # of dataset dimensions of size 9: 202
                Total small datasets: 1865
        1-D Dataset dimension bins:
                # of datasets of size 1 - 9: 1865
                # of datasets of size 10 - 99: 49044
                # of datasets of size 100 - 999: 5628
                Total # of datasets: 56537
Dataset storage information:
        Total raw data size: 74331886
Dataset layout information:
        Dataset layout counts[COMPACT]: 0
        Dataset layout counts[CONTIG]: 3845
        Dataset layout counts[CHUNKED]: 54922
        Number of external files : 0
Dataset filters information:
        Number of datasets with:
                NO filter: 3845
                GZIP filter: 54922
                SHUFFLE filter: 0
                FLETCHER32 filter: 0
                SZIP filter: 0
                NBIT filter: 0
                SCALEOFFSET filter: 0
                USER-DEFINED filter: 0

Quincey_Koziol · August 31, 2010, 11:44am

Hi Andy,

Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.

I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct?

Yes, I think your analysis is correct. Currently, there's at least one B-tree node per chunked dataset (that will be changing with the 1.10.0 release, when it's finished).

Is there a way to reduce the size?

You should be able to use the H5Pset_istore_k() API routine (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to reduce the B-tree fanout value.

Quincey

···

On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:

Cheers,
Andy

================================================================

File information
       # of unique groups: 31107
       # of unique datasets: 58767
       # of unique named datatypes: 0
       # of unique links: 0
       # of unique other: 0
       Max. # of links to object: 1
       Max. # of objects in group: 201
Object header size: (total/unused)
       Groups: 1311904/0
       Datasets: 27052128/16288
       Datatypes: 0/0
Storage information:
       Groups:
               B-tree/List: 28798864
               Heap: 4964928
       Attributes:
               B-tree/List: 0
               Heap: 0
       Chunked datasets:
               B-tree: 115116512
       Shared Messages:
               Header: 0
               B-tree/List: 0
               Heap: 0
       Superblock extension: 0
Small groups:
       # of groups of size 1: 1222
       # of groups of size 2: 27663
       # of groups of size 3: 1211
       # of groups of size 4: 203
       # of groups of size 5: 403
       Total # of small groups: 30702
Group bins:
       # of groups of size 1 - 9: 30702
       # of groups of size 10 - 99: 202
       # of groups of size 100 - 999: 203
       Total # of groups: 31107
Dataset dimension information:
       Max. rank of datasets: 1
       Dataset ranks:
               # of dataset with rank 0: 2230
               # of dataset with rank 1: 56537
1-D Dataset information:
       Max. dimension size of 1-D datasets: 600
       Small 1-D datasets:
               # of dataset dimensions of size 1: 1461
               # of dataset dimensions of size 4: 202
               # of dataset dimensions of size 9: 202
               Total small datasets: 1865
       1-D Dataset dimension bins:
               # of datasets of size 1 - 9: 1865
               # of datasets of size 10 - 99: 49044
               # of datasets of size 100 - 999: 5628
               Total # of datasets: 56537
Dataset storage information:
       Total raw data size: 74331886
Dataset layout information:
       Dataset layout counts[COMPACT]: 0
       Dataset layout counts[CONTIG]: 3845
       Dataset layout counts[CHUNKED]: 54922
       Number of external files : 0
Dataset filters information:
       Number of datasets with:
               NO filter: 3845
               GZIP filter: 54922
               SHUFFLE filter: 0
               FLETCHER32 filter: 0
               SZIP filter: 0
               NBIT filter: 0
               SCALEOFFSET filter: 0
               USER-DEFINED filter: 0

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Salnikov_Andrei_A · August 31, 2010, 3:04pm

Hi Quincey,

thanks for info. I could not seem to find what is the default
value for H5Pset_istore_k, is it documented anywhere? What would
be recommended value for our case with many small datasets?

Cheers,
Andy

···

Quincey Koziol wrote on 2010-08-31: > Hi Andy, > > On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:

Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.

I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct?

  Yes, I think your analysis is correct. Currently, there's at least
one B-tree node per chunked dataset (that will be changing with the 1.10.0
release, when it's finished).

Is there a way to reduce the size?

  You should be able to use the H5Pset_istore_k() API routine
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
reduce the B-tree fanout value.

  Quincey

Cheers,
Andy

================================================================

File information
       # of unique groups: 31107 # of unique datasets: 58767 # of
       unique named datatypes: 0 # of unique links: 0 # of unique
       other: 0 Max. # of links to object: 1 Max. # of objects in
       group: 201 Object header size: (total/unused) Groups: 1311904/0
       Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
       Groups:
               B-tree/List: 28798864 Heap: 4964928 Attributes:
               B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
               115116512 Shared Messages: Header: 0 B-tree/List: 0
               Heap: 0
       Superblock extension: 0 Small groups: # of groups of size 1:
       1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
       of groups of size 4: 203 # of groups of size 5: 403 Total # of
       small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
       # of groups of size 10 - 99: 202 # of groups of size 100 - 999:
       203 Total # of groups: 31107 Dataset dimension information: Max.
       rank of datasets: 1 Dataset ranks:
               # of dataset with rank 0: 2230
               # of dataset with rank 1: 56537
1-D Dataset information:
       Max. dimension size of 1-D datasets: 600
       Small 1-D datasets:
               # of dataset dimensions of size 1: 1461 # of dataset
               dimensions of size 4: 202 # of dataset dimensions of
               size 9: 202 Total small datasets: 1865 1-D Dataset
               dimension bins: # of datasets of size 1 - 9: 1865 # of
               datasets of size 10 - 99: 49044 # of datasets of size
               100 - 999: 5628 Total # of datasets: 56537
Dataset storage information:
       Total raw data size: 74331886 Dataset layout information:
       Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
       3845 Dataset layout counts[CHUNKED]: 54922 Number of external
       files : 0 Dataset filters information: Number of datasets with:
               NO filter: 3845
               GZIP filter: 54922
               SHUFFLE filter: 0
               FLETCHER32 filter: 0
               SZIP filter: 0
               NBIT filter: 0
               SCALEOFFSET filter: 0
               USER-DEFINED filter: 0

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Quincey_Koziol · August 31, 2010, 3:17pm

Hi Andy,

Hi Quincey,

thanks for info. I could not seem to find what is the default
value for H5Pset_istore_k, is it documented anywhere?

The default value is 32 (which means the fanout is 64). (You can also call H5Pget_istore_k() with a newly created property list to query it)

What would be recommended value for our case with many small datasets?

You could probably turn it all the way down to 2-4 without any problems, if all the datasets have very few chunks.

Quincey

···

On Aug 31, 2010, at 10:04 AM, Salnikov, Andrei A. wrote:

Cheers,
Andy

Quincey Koziol wrote on 2010-08-31: >> Hi Andy, >> >> On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:

Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.

I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct?

  Yes, I think your analysis is correct. Currently, there's at least
one B-tree node per chunked dataset (that will be changing with the 1.10.0
release, when it's finished).

Is there a way to reduce the size?

  You should be able to use the H5Pset_istore_k() API routine
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
reduce the B-tree fanout value.

  Quincey

Cheers,
Andy

================================================================

File information
      # of unique groups: 31107 # of unique datasets: 58767 # of
      unique named datatypes: 0 # of unique links: 0 # of unique
      other: 0 Max. # of links to object: 1 Max. # of objects in
      group: 201 Object header size: (total/unused) Groups: 1311904/0
      Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
      Groups:
              B-tree/List: 28798864 Heap: 4964928 Attributes:
              B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
              115116512 Shared Messages: Header: 0 B-tree/List: 0
              Heap: 0
      Superblock extension: 0 Small groups: # of groups of size 1:
      1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
      of groups of size 4: 203 # of groups of size 5: 403 Total # of
      small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
      # of groups of size 10 - 99: 202 # of groups of size 100 - 999:
      203 Total # of groups: 31107 Dataset dimension information: Max.
      rank of datasets: 1 Dataset ranks:
              # of dataset with rank 0: 2230
              # of dataset with rank 1: 56537
1-D Dataset information:
      Max. dimension size of 1-D datasets: 600
      Small 1-D datasets:
              # of dataset dimensions of size 1: 1461 # of dataset
              dimensions of size 4: 202 # of dataset dimensions of
              size 9: 202 Total small datasets: 1865 1-D Dataset
              dimension bins: # of datasets of size 1 - 9: 1865 # of
              datasets of size 10 - 99: 49044 # of datasets of size
              100 - 999: 5628 Total # of datasets: 56537
Dataset storage information:
      Total raw data size: 74331886 Dataset layout information:
      Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
      3845 Dataset layout counts[CHUNKED]: 54922 Number of external
      files : 0 Dataset filters information: Number of datasets with:
              NO filter: 3845
              GZIP filter: 54922
              SHUFFLE filter: 0
              FLETCHER32 filter: 0
              SZIP filter: 0
              NBIT filter: 0
              SCALEOFFSET filter: 0
              USER-DEFINED filter: 0

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org