Hi HDF5 gurus,
We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.
I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.
Is my analysis of B-Tree size growth correct? Is there a way to
reduce the size?
Cheers,
Andy
···
================================================================
File information
# of unique groups: 31107
# of unique datasets: 58767
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 201
Object header size: (total/unused)
Groups: 1311904/0
Datasets: 27052128/16288
Datatypes: 0/0
Storage information:
Groups:
B-tree/List: 28798864
Heap: 4964928
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
B-tree: 115116512
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Superblock extension: 0
Small groups:
# of groups of size 1: 1222
# of groups of size 2: 27663
# of groups of size 3: 1211
# of groups of size 4: 203
# of groups of size 5: 403
Total # of small groups: 30702
Group bins:
# of groups of size 1 - 9: 30702
# of groups of size 10 - 99: 202
# of groups of size 100 - 999: 203
Total # of groups: 31107
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 0: 2230
# of dataset with rank 1: 56537
1-D Dataset information:
Max. dimension size of 1-D datasets: 600
Small 1-D datasets:
# of dataset dimensions of size 1: 1461
# of dataset dimensions of size 4: 202
# of dataset dimensions of size 9: 202
Total small datasets: 1865
1-D Dataset dimension bins:
# of datasets of size 1 - 9: 1865
# of datasets of size 10 - 99: 49044
# of datasets of size 100 - 999: 5628
Total # of datasets: 56537
Dataset storage information:
Total raw data size: 74331886
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 3845
Dataset layout counts[CHUNKED]: 54922
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 3845
GZIP filter: 54922
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Hi Andy,
Hi HDF5 gurus,
We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.
I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.
Is my analysis of B-Tree size growth correct?
Yes, I think your analysis is correct. Currently, there's at least one B-tree node per chunked dataset (that will be changing with the 1.10.0 release, when it's finished).
Is there a way to reduce the size?
You should be able to use the H5Pset_istore_k() API routine (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to reduce the B-tree fanout value.
Quincey
···
On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
Cheers,
Andy
================================================================
File information
# of unique groups: 31107
# of unique datasets: 58767
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 201
Object header size: (total/unused)
Groups: 1311904/0
Datasets: 27052128/16288
Datatypes: 0/0
Storage information:
Groups:
B-tree/List: 28798864
Heap: 4964928
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
B-tree: 115116512
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Superblock extension: 0
Small groups:
# of groups of size 1: 1222
# of groups of size 2: 27663
# of groups of size 3: 1211
# of groups of size 4: 203
# of groups of size 5: 403
Total # of small groups: 30702
Group bins:
# of groups of size 1 - 9: 30702
# of groups of size 10 - 99: 202
# of groups of size 100 - 999: 203
Total # of groups: 31107
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 0: 2230
# of dataset with rank 1: 56537
1-D Dataset information:
Max. dimension size of 1-D datasets: 600
Small 1-D datasets:
# of dataset dimensions of size 1: 1461
# of dataset dimensions of size 4: 202
# of dataset dimensions of size 9: 202
Total small datasets: 1865
1-D Dataset dimension bins:
# of datasets of size 1 - 9: 1865
# of datasets of size 10 - 99: 49044
# of datasets of size 100 - 999: 5628
Total # of datasets: 56537
Dataset storage information:
Total raw data size: 74331886
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 3845
Dataset layout counts[CHUNKED]: 54922
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 3845
GZIP filter: 54922
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Hi Quincey,
thanks for info. I could not seem to find what is the default
value for H5Pset_istore_k, is it documented anywhere? What would
be recommended value for our case with many small datasets?
Cheers,
Andy
···
Quincey Koziol wrote on 2010-08-31: > Hi Andy, > > On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
Hi HDF5 gurus,
We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.
I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.
Is my analysis of B-Tree size growth correct?
Yes, I think your analysis is correct. Currently, there's at least
one B-tree node per chunked dataset (that will be changing with the 1.10.0
release, when it's finished).
Is there a way to reduce the size?
You should be able to use the H5Pset_istore_k() API routine
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
reduce the B-tree fanout value.
Quincey
Cheers,
Andy
================================================================
File information
# of unique groups: 31107 # of unique datasets: 58767 # of
unique named datatypes: 0 # of unique links: 0 # of unique
other: 0 Max. # of links to object: 1 Max. # of objects in
group: 201 Object header size: (total/unused) Groups: 1311904/0
Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
Groups:
B-tree/List: 28798864 Heap: 4964928 Attributes:
B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
115116512 Shared Messages: Header: 0 B-tree/List: 0
Heap: 0
Superblock extension: 0 Small groups: # of groups of size 1:
1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
of groups of size 4: 203 # of groups of size 5: 403 Total # of
small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
# of groups of size 10 - 99: 202 # of groups of size 100 - 999:
203 Total # of groups: 31107 Dataset dimension information: Max.
rank of datasets: 1 Dataset ranks:
# of dataset with rank 0: 2230
# of dataset with rank 1: 56537
1-D Dataset information:
Max. dimension size of 1-D datasets: 600
Small 1-D datasets:
# of dataset dimensions of size 1: 1461 # of dataset
dimensions of size 4: 202 # of dataset dimensions of
size 9: 202 Total small datasets: 1865 1-D Dataset
dimension bins: # of datasets of size 1 - 9: 1865 # of
datasets of size 10 - 99: 49044 # of datasets of size
100 - 999: 5628 Total # of datasets: 56537
Dataset storage information:
Total raw data size: 74331886 Dataset layout information:
Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
3845 Dataset layout counts[CHUNKED]: 54922 Number of external
files : 0 Dataset filters information: Number of datasets with:
NO filter: 3845
GZIP filter: 54922
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Hi Andy,
Hi Quincey,
thanks for info. I could not seem to find what is the default
value for H5Pset_istore_k, is it documented anywhere?
The default value is 32 (which means the fanout is 64). (You can also call H5Pget_istore_k() with a newly created property list to query it)
What would be recommended value for our case with many small datasets?
You could probably turn it all the way down to 2-4 without any problems, if all the datasets have very few chunks.
Quincey
···
On Aug 31, 2010, at 10:04 AM, Salnikov, Andrei A. wrote:
Cheers,
Andy
Quincey Koziol wrote on 2010-08-31: >> Hi Andy, >> >> On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
Hi HDF5 gurus,
We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying
sizes and complexity. In one extreme case we need to write
many (thousands) relatively small datasets in one HDF5 file.
Because the size of the datasets and their number is not known
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the
volume of the data stored. It seems that there is a significant
overhead per dataset for chunked data.
I tried to understand where this overhead comes from. Running
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data
itself. I collected the dependency of the B-Tree size on the
number of the chunked datasets in a file. It looks like the size
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per
every chunked dataset. With large number of datasets (tens of
thousands) the overhead becomes very big in our case. Below is
an example of h5stat output for one of our problematic files
if anybody is interested to look at it.
Is my analysis of B-Tree size growth correct?
Yes, I think your analysis is correct. Currently, there's at least
one B-tree node per chunked dataset (that will be changing with the 1.10.0
release, when it's finished).
Is there a way to reduce the size?
You should be able to use the H5Pset_istore_k() API routine
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
reduce the B-tree fanout value.
Quincey
Cheers,
Andy
================================================================
File information
# of unique groups: 31107 # of unique datasets: 58767 # of
unique named datatypes: 0 # of unique links: 0 # of unique
other: 0 Max. # of links to object: 1 Max. # of objects in
group: 201 Object header size: (total/unused) Groups: 1311904/0
Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
Groups:
B-tree/List: 28798864 Heap: 4964928 Attributes:
B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
115116512 Shared Messages: Header: 0 B-tree/List: 0
Heap: 0
Superblock extension: 0 Small groups: # of groups of size 1:
1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
of groups of size 4: 203 # of groups of size 5: 403 Total # of
small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
# of groups of size 10 - 99: 202 # of groups of size 100 - 999:
203 Total # of groups: 31107 Dataset dimension information: Max.
rank of datasets: 1 Dataset ranks:
# of dataset with rank 0: 2230
# of dataset with rank 1: 56537
1-D Dataset information:
Max. dimension size of 1-D datasets: 600
Small 1-D datasets:
# of dataset dimensions of size 1: 1461 # of dataset
dimensions of size 4: 202 # of dataset dimensions of
size 9: 202 Total small datasets: 1865 1-D Dataset
dimension bins: # of datasets of size 1 - 9: 1865 # of
datasets of size 10 - 99: 49044 # of datasets of size
100 - 999: 5628 Total # of datasets: 56537
Dataset storage information:
Total raw data size: 74331886 Dataset layout information:
Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
3845 Dataset layout counts[CHUNKED]: 54922 Number of external
files : 0 Dataset filters information: Number of datasets with:
NO filter: 3845
GZIP filter: 54922
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________ Hdf-forum is for HDF
software users discussion. Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org