Questions about size of generated Hdf5 files

Hello Hdf5 community, Quincey

I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f
subroutine

I have inserted these commands in my bench program

    call h5open_f(error)
    call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
    call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F,
H5F_LIBVER_LATEST_F, error)

However, I can't see any difference on the size of HDF5 generated files.
Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5
libraries and different number of elements (0,1 and 2) in each dataset

Version 1.8.16
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ee8157f1ce74936021b1958fb796741e *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:17 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
1790a5650bb945b17c0f8a4e59adec85 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:17 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:17 results.h5

Version 1.10.1
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ec8169773b9ea015c81fc4cb2205d727 *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
fae64160fe79f4af0ef382fd1790bf76 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5

···

2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot < guillaume.jacquenot@gmail.com>:

Hello Quincey

I am using version 1.8.16

I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64
floating point number.
I can specify the number n, which controls the number of times I saved my
data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5 file with size 7.13 Mo, which surprises me. Why
such an increase?
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset.
With 1 =0, I have a HDF5 file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data
storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same
behaviour? I have not tried since I don't know how to map the content of an
array when calling the h5dwrite_f function for a compound dataset.

If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks

Here is the error I have with contiguous dataset

  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable
to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861 in H5G_traverse():
internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to
open object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <hdf-forum-request@lists.hdfgroup.org>:

Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <koziol@lbl.gov>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <9B8A951B-D2F7-489F-8E60-005C4242E2CF@lbl.gov>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets? If chunked, what
size are you using? Also, can you use the ?latest? version of the format,
which should be smaller, but is only compatible with HDF5 1.10.x or later?
(i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey

> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot < >> guillaume.jacquenot@gmail.com> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused
about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000
datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the
raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can
contain everything.
>
> For comparaison,if I write everything in a text file, where each real
number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly
increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the
result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the
size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c
<https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot

Hi Guillaume,
  As Pierre mentioned, a chunk size of 1 is not reasonable and will generate a lot of metadata overhead. Something closer to 1MB of data elements would be much better.

  Quincey

···

On May 24, 2017, at 12:23 AM, Guillaume Jacquenot <guillaume.jacquenot@gmail.com> wrote:

Hello Hdf5 community, Quincey

I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f subroutine

I have inserted these commands in my bench program

    call h5open_f(error)
    call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
    call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F, H5F_LIBVER_LATEST_F, error)

However, I can't see any difference on the size of HDF5 generated files.
Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5 libraries and different number of elements (0,1 and 2) in each dataset

Version 1.8.16
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ee8157f1ce74936021b1958fb796741e *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:17 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
1790a5650bb945b17c0f8a4e59adec85 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:17 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:17 results.h5

Version 1.10.1
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ec8169773b9ea015c81fc4cb2205d727 *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
fae64160fe79f4af0ef382fd1790bf76 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5

2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot <guillaume.jacquenot@gmail.com <mailto:guillaume.jacquenot@gmail.com>>:
Hello Quincey

I am using version 1.8.16

I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64 floating point number.
I can specify the number n, which controls the number of times I saved my data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5 file with size 7.13 Mo, which surprises me. Why such an increase?
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset.
With 1 =0, I have a HDF5 file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same behaviour? I have not tried since I don't know how to map the content of an array when calling the h5dwrite_f function for a compound dataset.

If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks

Here is the error I have with contiguous dataset

  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <hdf-forum-request@lists.hdfgroup.org <mailto:hdf-forum-request@lists.hdfgroup.org>>:

Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <koziol@lbl.gov <mailto:koziol@lbl.gov>>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org <mailto:hdf-forum@lists.hdfgroup.org>>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <9B8A951B-D2F7-489F-8E60-005C4242E2CF@lbl.gov <mailto:9B8A951B-D2F7-489F-8E60-005C4242E2CF@lbl.gov>>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets? If chunked, what size are you using? Also, can you use the ?latest? version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later? (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey

> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <guillaume.jacquenot@gmail.com <mailto:guillaume.jacquenot@gmail.com>> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000 datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.
>
> For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5