Questions about size of generated Hdf5 files

Hello Quincey

I am using version 1.8.16
I have
I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64
floating point number.
I can specify the number n, which controls the number of times I saved my
data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5 file with size 7.13 Mo, which surprises me. Why
such an increase?
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset.
With 1 =0, I have a HDF5 file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data
storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same
behaviour? I have not tried since I don't know how to map the content of an
array when calling the h5dwrite_f function for a compound dataset.

If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks

Here is the error I have with contiguous dataset

  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to
create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861 in H5G_traverse(): internal
path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open
object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

···

2017-05-23 19:00 GMT+02:00 <hdf-forum-request@lists.hdfgroup.org>:

Send Hdf-forum mailing list submissions to
        hdf-forum@lists.hdfgroup.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_
lists.hdfgroup.org

or, via email, send a message with subject or body 'help' to
        hdf-forum-request@lists.hdfgroup.org

You can reach the person managing the list at
        hdf-forum-owner@lists.hdfgroup.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."

Today's Topics:

   1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
   2. Re: Parallel file access recommendation (Aaron Friesz)

----------------------------------------------------------------------

Message: 1
Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <koziol@lbl.gov>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <9B8A951B-D2F7-489F-8E60-005C4242E2CF@lbl.gov>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets? If chunked, what
size are you using? Also, can you use the ?latest? version of the format,
which should be smaller, but is only compatible with HDF5 1.10.x or later?
(i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey

> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot < > guillaume.jacquenot@gmail.com> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused
about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000
datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the raw
data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
everything.
>
> For comparaison,if I write everything in a text file, where each real
number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly
increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the
result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the
size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <
https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 23 May 2017 08:46:07 -0700
From: Aaron Friesz <friesz@usc.edu>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Parallel file access recommendation
Message-ID:
        <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail.
gmail.com>
Content-Type: text/plain; charset="utf-8"

A year or so back, we changed to BeeGFS as well. There were some issues
getting parrallel I/O setup. First thing you want to do is run the
parrallel mpio test. I believe they can be found here:
https://support.hdfgroup.org/HDF5/Tutor/pprog.html.

This will help you verify if your cluster has mpio setup correctly. If
that doesn't work, you'll need to get in touch with the management group to
fix that.

Then you need to make sure you are using an HDF5 library that is configured
to do parrallel I/O.

I know there aren't a lot of specifics here, but it took me about two weeks
of convincing to get my cluster management group to realize that things
weren't working quite right. Once everything was setup, I was able to
generate and write about 40 GB of data in around two minutes.

On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <koziol@lbl.gov> wrote:

> Hi Jan,
>
> > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich < > > jan.oliver.oelerich@physik.uni-marburg.de> wrote:
> >
> > Hello HDF users,
> >
> > I am using HDF5 through NetCDF and I recently changed my program so
that
> each MPI process writes its data directly to the output file as opposed
to
> the master process gathering the results and being the only one who does
> I/O.
> >
> > Now I see that my program slows down file systems a lot (of the whole
> HPC cluster) and I don't really know how to handle I/O. The file system
is
> a high throughput Beegfs system.
> >
> > My program uses a hybrid parallelization approach, i.e. work is split
> into N MPI processes, each of which spawns M worker threads. Currently, I
> write to the output file from each of the M*N threads, but the writing is
> guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> process is a complete `open file, write, close file` cycle.
> >
> > Each write is at a separate region of the HDF5 file, so no chunks are
> shared among any two processes. The amount of data to be written per
> process is 1/(M*N) times the size of the whole file.
> >
> > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What
is
> the `best practice` regarding parallel file access with HDF5?
>
> Yes, this is probably the correct way to operate, but generally
> things are much better for this case when collective I/O operations are
> used. Are you using collective or independent I/O? (Independent is the
> default)
>
> Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> hdfgroup.org_mailman_listinfo_hdf-2Dforum-5Flists.hdfgroup.org
&d=DwICAg&c=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=Rx9txIqgEINHtVDIDfXdIw&m=
> lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=5GdG4kU-9hw-z8kHIDPj6-
> WfvdQeASwtycyfNyQ1tn0&e=
> Twitter: https://urldefense.proofpoint.com/v2/url?u=https-3A__
> twitter.com_hdf5&d=DwICAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN
0H8p7CSfnc_gI&r=
> Rx9txIqgEINHtVDIDfXdIw&m=lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=
> YAEy34105plaH2V5vqw54_wLbsigIZ__8F13hUdNgEQ&e=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
hdfgroup.org/attachments/20170523/aee9a001/attachment-0001.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

------------------------------

End of Hdf-forum Digest, Vol 95, Issue 24
*****************************************

Hello,

I just react to this because of the chunk size. Every chunk carries metadata so
a chunk should contain a non-negligible amount of data to avoid inefficiencies
and large file sizes. The guideline of the HDF5 documentation is a chunk size of
the order of 1MB.

Regards,

Pierre

···

On Tue, May 23, 2017 at 07:12:47PM +0200, Guillaume Jacquenot wrote:

Hello Quincey

I am using version 1.8.16
I have
I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64
floating point number.
I can specify the number n, which controls the number of times I saved my
data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5 file with size 7.13 Mo, which surprises me. Why
such an increase?
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset.
With 1 =0, I have a HDF5 file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an
increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data
storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same
behaviour? I have not tried since I don't know how to map the content of an
array when calling the h5dwrite_f function for a compound dataset.

If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks

Here is the error I have with contiguous dataset

  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to
create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861 in H5G_traverse(): internal
path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open
object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <hdf-forum-request@lists.hdfgroup.org>:

> Send Hdf-forum mailing list submissions to
> hdf-forum@lists.hdfgroup.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_
> lists.hdfgroup.org
>
> or, via email, send a message with subject or body 'help' to
> hdf-forum-request@lists.hdfgroup.org
>
> You can reach the person managing the list at
> hdf-forum-owner@lists.hdfgroup.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
>
>
> Today's Topics:
>
> 1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
> 2. Re: Parallel file access recommendation (Aaron Friesz)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 23 May 2017 08:22:59 -0700
> From: Quincey Koziol <koziol@lbl.gov>
> To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
> Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
> Message-ID: <9B8A951B-D2F7-489F-8E60-005C4242E2CF@lbl.gov>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Guillaume,
> Are you using chunked or contiguous datasets? If chunked, what
> size are you using? Also, can you use the ?latest? version of the format,
> which should be smaller, but is only compatible with HDF5 1.10.x or later?
> (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )
>
> Quincey
>
>
> > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot < > > guillaume.jacquenot@gmail.com> wrote:
> >
> > Hello everyone!
> >
> > I am creating a HDF5 file from a Fortran program, and I am confused
> about the size of my generated HDF5 file.
> >
> > I am writing 19000 datasets with 21 values of 64 bit (real number).
> > I write one value at a time, and extend with one each of the 19000
> datasets everytime.
> > All data are correctly written.
> > But the generated file is more than 48 Mo.
> > I expected the total size of the file to be a little bigger than the raw
> data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
> means each empty dataset is about 400 bytes.
> > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
> everything.
> >
> > For comparaison,if I write everything in a text file, where each real
> number is written with 15 characters, I obtain a 6 Mo CSV file.
> >
> > Question 1)
> > Is this behaviour normal?
> >
> > Question 2)
> > Does extending dataset each time we write data inside can significantly
> increase the total required space disk size?
> > Does preallocating dataset and using hyperslab can save some space?
> > Does chunk parameters can impact the size of generated hdf5 file
> >
> > Question 3)
> > If I pack everything in a compound dataset with 19000 columns, will the
> result file be smaller?
> >
> > N.B:
> > When looking at the example of generating 100000 groups (grplots.c),the
> size of the generated HD5 file is 78 Mo for 100000 empty groups
> > That means each group is about 780 bytes
> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
> >
> > Guillaume Jacquenot
> >
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > Hdf-forum@lists.hdfgroup.org
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
> hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 23 May 2017 08:46:07 -0700
> From: Aaron Friesz <friesz@usc.edu>
> To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
> Subject: Re: [Hdf-forum] Parallel file access recommendation
> Message-ID:
> <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> A year or so back, we changed to BeeGFS as well. There were some issues
> getting parrallel I/O setup. First thing you want to do is run the
> parrallel mpio test. I believe they can be found here:
> https://support.hdfgroup.org/HDF5/Tutor/pprog.html.
>
> This will help you verify if your cluster has mpio setup correctly. If
> that doesn't work, you'll need to get in touch with the management group to
> fix that.
>
> Then you need to make sure you are using an HDF5 library that is configured
> to do parrallel I/O.
>
> I know there aren't a lot of specifics here, but it took me about two weeks
> of convincing to get my cluster management group to realize that things
> weren't working quite right. Once everything was setup, I was able to
> generate and write about 40 GB of data in around two minutes.
>
> On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <koziol@lbl.gov> wrote:
>
> > Hi Jan,
> >
> > > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich < > > > jan.oliver.oelerich@physik.uni-marburg.de> wrote:
> > >
> > > Hello HDF users,
> > >
> > > I am using HDF5 through NetCDF and I recently changed my program so
> that
> > each MPI process writes its data directly to the output file as opposed
> to
> > the master process gathering the results and being the only one who does
> > I/O.
> > >
> > > Now I see that my program slows down file systems a lot (of the whole
> > HPC cluster) and I don't really know how to handle I/O. The file system
> is
> > a high throughput Beegfs system.
> > >
> > > My program uses a hybrid parallelization approach, i.e. work is split
> > into N MPI processes, each of which spawns M worker threads. Currently, I
> > write to the output file from each of the M*N threads, but the writing is
> > guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> > process is a complete `open file, write, close file` cycle.
> > >
> > > Each write is at a separate region of the HDF5 file, so no chunks are
> > shared among any two processes. The amount of data to be written per
> > process is 1/(M*N) times the size of the whole file.
> > >
> > > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What
> is
> > the `best practice` regarding parallel file access with HDF5?
> >
> > Yes, this is probably the correct way to operate, but generally
> > things are much better for this case when collective I/O operations are
> > used. Are you using collective or independent I/O? (Independent is the
> > default)
> >
> > Quincey
> >