Group creation gets very slow after a huge number of group created

Hello all,

The HDF5 faq (https://www.hdfgroup.org/HDF5/faq/limits.html) refer to an example that create 100'000 groups in the 'How many links can be in a group?' section.

My problem is that I need to create at least 1'000'000 groups in a single file, and the creation time decrease a lot after about 900'000.
The application is written in C++ with hdf 1.8.5, running on Windows 7-64 16Gb ram.

For a faster investigation, I wrote a very single python example and I can reproduce this issue on iMac 64bit, 32Gb ram, OSX 10.11.
The average time is between 6-7 seconds to create 100'000 groups, and became about 6 minutes after 900'000 groups are created!!!

I suppose that I need to configure something in HDF5 to avoid this kind of issue, i.e. set a greater cache size, or anything else...
I'll really appreciate if someone know the reason of this behavior!
Here is the python example with the produced output.
Best regards,
Levent

import h5py as h5

from datetime import datetime

print(h5.version.info)

hf = h5.File("f.h5", "w")

print(str(datetime.now())) # start timestamp

for i in range(1, 1000000):

    hf.create_group("/Acquisition."+str(i)) # create a group

    if not i % 100000:

        print(str(datetime.now()) + ' : ' + str(i)) # time stamp on each 100'000 groups created

print(str(datetime.now())) # end timestamp

Summary of the h5py configuration

···

---------------------------------

h5py 2.5.0

HDF5 1.8.13

Python 3.5.0 (default, Sep 14 2015, 02:37:27) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]

sys.platform darwin

sys.maxsize 9223372036854775807

numpy 1.10.1

2015-11-25 10:16:48.109794

2015-11-25 10:16:54.340278 : 100000

2015-11-25 10:17:00.661270 : 200000

2015-11-25 10:17:07.006722 : 300000

2015-11-25 10:17:13.435274 : 400000

2015-11-25 10:17:19.829139 : 500000

2015-11-25 10:17:27.221807 : 600000

2015-11-25 10:17:33.599402 : 700000

2015-11-25 10:17:39.979077 : 800000

2015-11-25 10:17:46.284342 : 900000

2015-11-25 10:23:36.377318

Hi,

Try to use H5Pset_libver_bounds function (see https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds) using H5F_LIBVER_LATEST for the second and third arguments to set up a file access property list and then use the access property list when opening existing file or creating a new one.

here is a C code snippet:

fapl_id = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_libver_bounds (fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
file_id = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, fapl_d);

By default, the HDF5 library uses the earliest version of the file format when creating groups. The indexing structure used for that version has a know deficiency when working with a big number (>50K) of objects in a group. The issue was addressed in HDF5 1.8, but requires an applications to “turn on” the latest file format.

Implications of the latest file format on the performance are not well documented. The HDF Group is aware of the issue and will be addressing it for the upcoming releases.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 25, 2015, at 7:46 AM, levent_erbuke@keysight.com<mailto:levent_erbuke@keysight.com> wrote:

Hello all,

The HDF5 faq (https://www.hdfgroup.org/HDF5/faq/limits.html) refer to an example that create 100’000 groups in the ‘How many links can be in a group?’ section.

My problem is that I need to create at least 1’000’000 groups in a single file, and the creation time decrease a lot after about 900’000.
The application is written in C++ with hdf 1.8.5, running on Windows 7-64 16Gb ram.

For a faster investigation, I wrote a very single python example and I can reproduce this issue on iMac 64bit, 32Gb ram, OSX 10.11.
The average time is between 6-7 seconds to create 100’000 groups, and became about 6 minutes after 900’000 groups are created!!!

I suppose that I need to configure something in HDF5 to avoid this kind of issue, i.e. set a greater cache size, or anything else…
I’ll really appreciate if someone know the reason of this behavior!
Here is the python example with the produced output.
Best regards,
Levent

import h5py as h5
from datetime import datetime

print(h5.version.info)
hf = h5.File("f.h5", "w")
print(str(datetime.now())) # start timestamp

for i in range(1, 1000000):
    hf.create_group("/Acquisition."+str(i)) # create a group
    if not i % 100000:
        print(str(datetime.now()) + ' : ' + str(i)) # time stamp on each 100’000 groups created

print(str(datetime.now())) # end timestamp

Summary of the h5py configuration
---------------------------------
h5py 2.5.0
HDF5 1.8.13
Python 3.5.0 (default, Sep 14 2015, 02:37:27) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.10.1

2015-11-25 10:16:48.109794
2015-11-25 10:16:54.340278 : 100000
2015-11-25 10:17:00.661270 : 200000
2015-11-25 10:17:07.006722 : 300000
2015-11-25 10:17:13.435274 : 400000
2015-11-25 10:17:19.829139 : 500000
2015-11-25 10:17:27.221807 : 600000
2015-11-25 10:17:33.599402 : 700000
2015-11-25 10:17:39.979077 : 800000
2015-11-25 10:17:46.284342 : 900000
2015-11-25 10:23:36.377318

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Wouaoooh!
I'm very impressed by your answer, it works like a charm!!
Sincerely I think I'll never found this by myself...
I've tested with my python script, and I'll try tomorrow with the C++ application.

import h5py as h5
from datetime import datetime

print(h5.version.info<http://h5.version.info>)

fapl = h5.h5p.create(h5.h5p.FILE_ACCESS)
print(h5.h5p.PropFAID.set_libver_bounds(fapl, h5.h5f.LIBVER_LATEST, h5.h5f.LIBVER_LATEST))

hf = h5.h5f.create(b'f.h5', h5.h5f.ACC_TRUNC, None, fapl)

print(str(datetime.now()))

for i in range(1, 1000000):

    g = h5.h5g.create(hf, b"/Acquisition.%d" % i)

    if not i % 100000:
        print(str(datetime.now()) + ' : ' + str(i))
        h5.h5f.flush(hf, h5.h5f.SCOPE_GLOBAL)

print(str(datetime.now()))
print(hf.get_freespace())

Summary of the h5py configuration

···

---------------------------------
h5py 2.5.0
HDF5 1.8.13
Python 3.5.0 (default, Sep 14 2015, 02:37:27)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.10.1

None
2015-11-25 21:55:11.780982
2015-11-25 21:55:16.172213 : 100000
2015-11-25 21:55:21.737195 : 200000
2015-11-25 21:55:27.673166 : 300000
2015-11-25 21:55:33.703066 : 400000
2015-11-25 21:55:39.834696 : 500000
2015-11-25 21:55:46.142189 : 600000
2015-11-25 21:55:52.880594 : 700000
2015-11-25 21:55:59.394233 : 800000
2015-11-25 21:56:05.996508 : 900000
2015-11-25 21:56:12.686513
946614

Process finished with exit code 0

Thanks a lot Elena!
:wink:
Levent

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Wednesday, November 25, 2015 19:27
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Group creation gets very slow after a huge number of group created

Hi,

Try to use H5Pset_libver_bounds function (see https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds) using H5F_LIBVER_LATEST for the second and third arguments to set up a file access property list and then use the access property list when opening existing file or creating a new one.

here is a C code snippet:

fapl_id = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_libver_bounds (fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
file_id = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, fapl_d);

By default, the HDF5 library uses the earliest version of the file format when creating groups. The indexing structure used for that version has a know deficiency when working with a big number (>50K) of objects in a group. The issue was addressed in HDF5 1.8, but requires an applications to "turn on" the latest file format.

Implications of the latest file format on the performance are not well documented. The HDF Group is aware of the issue and will be addressing it for the upcoming releases.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 25, 2015, at 7:46 AM, levent_erbuke@keysight.com<mailto:levent_erbuke@keysight.com> wrote:

Hello all,

The HDF5 faq (https://www.hdfgroup.org/HDF5/faq/limits.html) refer to an example that create 100'000 groups in the 'How many links can be in a group?' section.

My problem is that I need to create at least 1'000'000 groups in a single file, and the creation time decrease a lot after about 900'000.
The application is written in C++ with hdf 1.8.5, running on Windows 7-64 16Gb ram.

For a faster investigation, I wrote a very single python example and I can reproduce this issue on iMac 64bit, 32Gb ram, OSX 10.11.
The average time is between 6-7 seconds to create 100'000 groups, and became about 6 minutes after 900'000 groups are created!!!

I suppose that I need to configure something in HDF5 to avoid this kind of issue, i.e. set a greater cache size, or anything else...
I'll really appreciate if someone know the reason of this behavior!
Here is the python example with the produced output.
Best regards,
Levent

import h5py as h5
from datetime import datetime

print(h5.version.info)
hf = h5.File("f.h5", "w")
print(str(datetime.now())) # start timestamp

for i in range(1, 1000000):
    hf.create_group("/Acquisition."+str(i)) # create a group
    if not i % 100000:
        print(str(datetime.now()) + ' : ' + str(i)) # time stamp on each 100'000 groups created

print(str(datetime.now())) # end timestamp

Summary of the h5py configuration
---------------------------------
h5py 2.5.0
HDF5 1.8.13
Python 3.5.0 (default, Sep 14 2015, 02:37:27) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.10.1

2015-11-25 10:16:48.109794
2015-11-25 10:16:54.340278 : 100000
2015-11-25 10:17:00.661270 : 200000
2015-11-25 10:17:07.006722 : 300000
2015-11-25 10:17:13.435274 : 400000
2015-11-25 10:17:19.829139 : 500000
2015-11-25 10:17:27.221807 : 600000
2015-11-25 10:17:33.599402 : 700000
2015-11-25 10:17:39.979077 : 800000
2015-11-25 10:17:46.284342 : 900000
2015-11-25 10:23:36.377318

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Interesting observation. I also experienced performance drop after creating many HDF5 files, after I performed a micro-benchmark. I can attach the results that show this.

Thanks,

Dimos

···

On Nov 25, 2015, at 1:27 PM, hdf-forum-request@lists.hdfgroup.org wrote:

Send Hdf-forum mailing list submissions to
  hdf-forum@lists.hdfgroup.org

To subscribe or unsubscribe via the World Wide Web, visit
  http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

or, via email, send a message with subject or body 'help' to
  hdf-forum-request@lists.hdfgroup.org

You can reach the person managing the list at
  hdf-forum-owner@lists.hdfgroup.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."

Today's Topics:

  1. Re: Delete object and its attributes from HDF5 file
     (Elena Pourmal)
  2. Re: Group creation gets very slow after a huge number of
     group created (Elena Pourmal)

----------------------------------------------------------------------

Message: 1
Date: Wed, 25 Nov 2015 18:03:44 +0000
From: Elena Pourmal <epourmal@hdfgroup.org>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Delete object and its attributes from HDF5
  file
Message-ID: <EDF39060-4E5D-46D9-B338-3C2DAE1E5075@hdfgroup.org>
Content-Type: text/plain; charset="windows-1252"

Hi,

You may try to use the h5edit tool to delete/add attributes; see https://www.hdfgroup.org/projects/jpss/h5edit_index.html.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 25, 2015, at 5:46 AM, Hakan Ko?ak <hkocak@mgm.gov.tr<mailto:hkocak@mgm.gov.tr>> wrote:

Dear All,

I have a question regarding deleting an object or dataset (and its attributes as well) from a HDF5 file.
As far as I know there is no command line tool for this.
As a workaround, I tried to copy all other objects except the one I want to get rid of in an HDF file to a new HDF file
It worked but this time the attributes of the root group were not copied to the new file.
And I could not find a way to copy the attributes of the root group to the new file.

So, I?d like to ask you, if there is a way to get rid of (delete) an object/objects and its/their attributes in and HDF file
OR, a way/method to copy the attributes of the root group to a new hdf file.

Thanks and Regards,

Hakan

________________________________
Bu e-posta ve muhtemel eklerinde verilen bilgiler kisiye ozel ve gizli olup, yalnizca mesajda belirlenen alici ile ilgilidir.
Bu mesajda bulunan tum fikir ve gorusler ve ekindeki dosyalar sadece adres sahip(ler)ine ait olup, Meteoroloji Genel Mudurlugu hic bir sekilde sorumlu tutulamaz. Meteoroloji Genel Mudurlugu, mesajin ve bilgilerinin size degisiklige ugrayarak veya gec ulasmasindan, butunlugunun ve gizliliginin korunamamasindan, virus icermesinden ve bilgisayar sisteminize verebilecegi herhangi bir zarardan sorumlu tutulamaz.
________________________________
This message and attachments are confidential and intended solely for the individual(s) stated in this message.
This email is not intended to impose nor shall it be construed as imposing any legally binding obligation upon Turkish State Meteorological Service and/or any of its subsidiaries or associated companies. Neither Turkish State Meteorological Service nor any of its subsidiaries or associated companies gives any representation or warranty as to the accuracy or completeness of the contents of this email. Turkish State Meteorological Service shall not be held liable to any person resulting from the use of any information contained in this email and shall not be liable to any person who acts or omits to do anything in reliance upon it.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20151125/11982044/attachment-0001.html>

------------------------------

Message: 2
Date: Wed, 25 Nov 2015 18:26:59 +0000
From: Elena Pourmal <epourmal@hdfgroup.org>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Group creation gets very slow after a huge
  number of group created
Message-ID: <F9137F05-C123-4686-8D08-2CAF19B3939A@hdfgroup.org>
Content-Type: text/plain; charset="windows-1252"

Hi,

Try to use H5Pset_libver_bounds function (see https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds) using H5F_LIBVER_LATEST for the second and third arguments to set up a file access property list and then use the access property list when opening existing file or creating a new one.

here is a C code snippet:

fapl_id = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_libver_bounds (fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
file_id = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, fapl_d);

By default, the HDF5 library uses the earliest version of the file format when creating groups. The indexing structure used for that version has a know deficiency when working with a big number (>50K) of objects in a group. The issue was addressed in HDF5 1.8, but requires an applications to ?turn on? the latest file format.

Implications of the latest file format on the performance are not well documented. The HDF Group is aware of the issue and will be addressing it for the upcoming releases.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Nov 25, 2015, at 7:46 AM, levent_erbuke@keysight.com<mailto:levent_erbuke@keysight.com> wrote:

Hello all,

The HDF5 faq (https://www.hdfgroup.org/HDF5/faq/limits.html) refer to an example that create 100?000 groups in the ?How many links can be in a group?? section.

My problem is that I need to create at least 1?000?000 groups in a single file, and the creation time decrease a lot after about 900?000.
The application is written in C++ with hdf 1.8.5, running on Windows 7-64 16Gb ram.

For a faster investigation, I wrote a very single python example and I can reproduce this issue on iMac 64bit, 32Gb ram, OSX 10.11.
The average time is between 6-7 seconds to create 100?000 groups, and became about 6 minutes after 900?000 groups are created!!!

I suppose that I need to configure something in HDF5 to avoid this kind of issue, i.e. set a greater cache size, or anything else?
I?ll really appreciate if someone know the reason of this behavior!
Here is the python example with the produced output.
Best regards,
Levent

import h5py as h5
from datetime import datetime

print(h5.version.info)
hf = h5.File("f.h5", "w")
print(str(datetime.now())) # start timestamp

for i in range(1, 1000000):
   hf.create_group("/Acquisition."+str(i)) # create a group
   if not i % 100000:
       print(str(datetime.now()) + ' : ' + str(i)) # time stamp on each 100?000 groups created

print(str(datetime.now())) # end timestamp

Summary of the h5py configuration
---------------------------------
h5py 2.5.0
HDF5 1.8.13
Python 3.5.0 (default, Sep 14 2015, 02:37:27) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.10.1

2015-11-25 10:16:48.109794
2015-11-25 10:16:54.340278 : 100000
2015-11-25 10:17:00.661270 : 200000
2015-11-25 10:17:07.006722 : 300000
2015-11-25 10:17:13.435274 : 400000
2015-11-25 10:17:19.829139 : 500000
2015-11-25 10:17:27.221807 : 600000
2015-11-25 10:17:33.599402 : 700000
2015-11-25 10:17:39.979077 : 800000
2015-11-25 10:17:46.284342 : 900000
2015-11-25 10:23:36.377318

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20151125/40478adc/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

------------------------------

End of Hdf-forum Digest, Vol 77, Issue 43
*****************************************