Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files

which we need to understand. Our data sets have wildly varying

sizes and complexity. In one extreme case we need to write

many (thousands) relatively small datasets in one HDF5 file.

Because the size of the datasets and their number is not known

in advance we prefer to use chunked storage for the data.

What we see is that when the number of chunked datasets is large

then the size of the HDF5 file becomes much large that the

volume of the data stored. It seems that there is a significant

overhead per dataset for chunked data.

I tried to understand where this overhead comes from. Running

h5stat on several files I see that the size of the B-Tree for

chunked datasets in our case takes more space than even data

itself. I collected the dependency of the B-Tree size on the

number of the chunked datasets in a file. It looks like the size

of the tree grows linearly with the number of the chunked datasets

(at last in our case when there is just one chunk per dataset).

It seems that there is 2096 bytes of B-Tree space allocated per

every chunked dataset. With large number of datasets (tens of

thousands) the overhead becomes very big in our case. Below is

an example of h5stat output for one of our problematic files

if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct? Is there a way to

reduce the size?

Cheers,

Andy

## ···

================================================================

File information

# of unique groups: 31107

# of unique datasets: 58767

# of unique named datatypes: 0

# of unique links: 0

# of unique other: 0

Max. # of links to object: 1

Max. # of objects in group: 201

Object header size: (total/unused)

Groups: 1311904/0

Datasets: 27052128/16288

Datatypes: 0/0

Storage information:

Groups:

B-tree/List: 28798864

Heap: 4964928

Attributes:

B-tree/List: 0

Heap: 0

Chunked datasets:

B-tree: 115116512

Shared Messages:

Header: 0

B-tree/List: 0

Heap: 0

Superblock extension: 0

Small groups:

# of groups of size 1: 1222

# of groups of size 2: 27663

# of groups of size 3: 1211

# of groups of size 4: 203

# of groups of size 5: 403

Total # of small groups: 30702

Group bins:

# of groups of size 1 - 9: 30702

# of groups of size 10 - 99: 202

# of groups of size 100 - 999: 203

Total # of groups: 31107

Dataset dimension information:

Max. rank of datasets: 1

Dataset ranks:

# of dataset with rank 0: 2230

# of dataset with rank 1: 56537

1-D Dataset information:

Max. dimension size of 1-D datasets: 600

Small 1-D datasets:

# of dataset dimensions of size 1: 1461

# of dataset dimensions of size 4: 202

# of dataset dimensions of size 9: 202

Total small datasets: 1865

1-D Dataset dimension bins:

# of datasets of size 1 - 9: 1865

# of datasets of size 10 - 99: 49044

# of datasets of size 100 - 999: 5628

Total # of datasets: 56537

Dataset storage information:

Total raw data size: 74331886

Dataset layout information:

Dataset layout counts[COMPACT]: 0

Dataset layout counts[CONTIG]: 3845

Dataset layout counts[CHUNKED]: 54922

Number of external files : 0

Dataset filters information:

Number of datasets with:

NO filter: 3845

GZIP filter: 54922

SHUFFLE filter: 0

FLETCHER32 filter: 0

SZIP filter: 0

NBIT filter: 0

SCALEOFFSET filter: 0

USER-DEFINED filter: 0