Questions about size of generated Hdf5 files

Hello everyone!

I am creating a HDF5 file from a Fortran program, and I am confused about
the size of my generated HDF5 file.

I am writing 19000 datasets with 21 values of 64 bit (real number).
I write one value at a time, and extend with one each of the 19000 datasets
everytime.
All data are correctly written.
But the generated file is more than 48 Mo.
I expected the total size of the file to be a little bigger than the raw
data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
means each empty dataset is about 400 bytes.
I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
everything.

For comparaison,if I write everything in a text file, where each real
number is written with 15 characters, I obtain a 6 Mo CSV file.

Question 1)
Is this behaviour normal?

Question 2)
Does extending dataset each time we write data inside can significantly
increase the total required space disk size?
Does preallocating dataset and using hyperslab can save some space?
Does chunk parameters can impact the size of generated hdf5 file

Question 3)
If I pack everything in a compound dataset with 19000 columns, will the
result file be smaller?

N.B:
When looking at the example of generating 100000 groups (grplots.c),the
size of the generated HD5 file is 78 Mo for 100000 empty groups
That means each group is about 780 bytes
https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c

Guillaume Jacquenot

Hi Guillaume,
  Are you using chunked or contiguous datasets? If chunked, what size are you using? Also, can you use the “latest” version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later? (i.e. H5Pset_libver_bounds with “latest” for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm )

  Quincey

···

On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <guillaume.jacquenot@gmail.com> wrote:

Hello everyone!

I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.

I am writing 19000 datasets with 21 values of 64 bit (real number).
I write one value at a time, and extend with one each of the 19000 datasets everytime.
All data are correctly written.
But the generated file is more than 48 Mo.
I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.

For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.

Question 1)
Is this behaviour normal?

Question 2)
Does extending dataset each time we write data inside can significantly increase the total required space disk size?
Does preallocating dataset and using hyperslab can save some space?
Does chunk parameters can impact the size of generated hdf5 file

Question 3)
If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?

N.B:
When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
That means each group is about 780 bytes
https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c

Guillaume Jacquenot

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5