Large number of datasets

James_Philbin · July 13, 2008, 9:34pm

Hi,

I'm planning on using hdf5 for storing data on large image datasets
(1M+ images). I'd ideally like to store the data from each image as a
separate dataset in an hdf5 file. When I experimented previously with
having a flat hierarchy (i.e. each image linked to the root group), it
seemed hdf5 became extremely slow at performing almost any operation
(iteration,etc). When I queried this on the PyTables mailing list the
consensus seemed to be that using a hierarchy so that the number of
nodes in a group didn't exceed 256 was required to maintain acceptable
speeds (with v 1.6.5). At the time I didn't follow this suggestion up,
but i'm wondering now with hdf5 1.8.0 whether this situation has
changed at all?

Does anyone else have any experience with storing this large number of datasets?

Many thanks,
James

These were the c programs I used to stress test 1.6.5 (creation was
fairly fast but iteration was >1sec per dataset):
--- hdf5_stress_test.c ---
#include <stdio.h>
#include <stdlib.h>

#include "H5LT.h"

int
main(void)
{
hid_t file_id;
hsize_t dims[2];
int data[256];
char dset_name[32];
herr_t status;
int i;
int total = 1000000;

dims[0] = 16;
dims[1] = 16;

file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC,
H5P_DEFAULT, H5P_DEFAULT);

for (i=0; i<total; ++i) {
   sprintf(dset_name, "/dset_%07d", i);
   status = H5LTmake_dataset(file_id, dset_name, 2, dims,
H5T_NATIVE_INT, data);
   if (!(i%1000))
     printf("\r[%07d/%07d]", i, total);
   fflush(stdout);
}
status = H5Fclose(file_id);

return 0;
}
--- ---

--- hdf5_iterate.c ---
#include "hdf5.h"

herr_t file_info(hid_t loc_id, const char *name, void *opdata);

int
main(void)
{
hid_t file;
hid_t dataset;
hid_t group;

file = H5Fopen("hdf5_stress_test.h5", H5F_ACC_RDONLY, H5P_DEFAULT);

H5Giterate(file, "/", NULL, file_info, NULL);

H5Fclose(file);

return 0;
}

herr_t file_info(hid_t loc_id, const char *name, void *opdata)
{
H5G_stat_t statbuf;

   /*
    * Get type of the object and display its name and type.
    * The name of the object is passed to this function by
    * the Library. Some magic
    */
   H5Gget_objinfo(loc_id, name, 0, &statbuf);
   switch (statbuf.type) {
   case H5G_GROUP:
        printf(" Object with name %s is a group \n", name);
        break;
   case H5G_DATASET:
        printf(" Object with name %s is a dataset \n", name);
        break;
   case H5G_TYPE:
        printf(" Object with name %s is a named datatype \n", name);
        break;
   default:
        printf(" Unable to identify an object ");
   }
   return 0;
}
--- ---

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · July 13, 2008, 9:38pm

Hi James,

Hi,

I'm planning on using hdf5 for storing data on large image datasets
(1M+ images). I'd ideally like to store the data from each image as a
separate dataset in an hdf5 file. When I experimented previously with
having a flat hierarchy (i.e. each image linked to the root group), it
seemed hdf5 became extremely slow at performing almost any operation
(iteration,etc). When I queried this on the PyTables mailing list the
consensus seemed to be that using a hierarchy so that the number of
nodes in a group didn't exceed 256 was required to maintain acceptable
speeds (with v 1.6.5). At the time I didn't follow this suggestion up,
but i'm wondering now with hdf5 1.8.0 whether this situation has
changed at all?

Yes, some of the enhancements to groups in the 1.8.x releases should help speed up groups with many links to objects. Try use the H5Pset_libver_bounds() routine with H5F_LIBVER_LATEST for both the upper and lower bounds and see if that speeds things up for you.

Quincey

···

On Jul 13, 2008, at 4:34 PM, James Philbin wrote:

Does anyone else have any experience with storing this large number of datasets?

Many thanks,
James

These were the c programs I used to stress test 1.6.5 (creation was
fairly fast but iteration was >1sec per dataset):
--- hdf5_stress_test.c ---
#include <stdio.h>
#include <stdlib.h>

#include "H5LT.h"

int
main(void)
{
hid_t file_id;
hsize_t dims[2];
int data[256];
char dset_name[32];
herr_t status;
int i;
int total = 1000000;

dims[0] = 16;
dims[1] = 16;

file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC,
H5P_DEFAULT, H5P_DEFAULT);

for (i=0; i<total; ++i) {
  sprintf(dset_name, "/dset_%07d", i);
  status = H5LTmake_dataset(file_id, dset_name, 2, dims,
H5T_NATIVE_INT, data);
  if (!(i%1000))
    printf("\r[%07d/%07d]", i, total);
  fflush(stdout);
}
status = H5Fclose(file_id);

return 0;
}
--- ---

--- hdf5_iterate.c ---
#include "hdf5.h"

herr_t file_info(hid_t loc_id, const char *name, void *opdata);

int
main(void)
{
hid_t file;
hid_t dataset;
hid_t group;

file = H5Fopen("hdf5_stress_test.h5", H5F_ACC_RDONLY, H5P_DEFAULT);

H5Giterate(file, "/", NULL, file_info, NULL);

H5Fclose(file);

return 0;
}

herr_t file_info(hid_t loc_id, const char *name, void *opdata)
{
  H5G_stat_t statbuf;

  /*
   * Get type of the object and display its name and type.
   * The name of the object is passed to this function by
   * the Library. Some magic
   */
  H5Gget_objinfo(loc_id, name, 0, &statbuf);
  switch (statbuf.type) {
  case H5G_GROUP:
       printf(" Object with name %s is a group \n", name);
       break;
  case H5G_DATASET:
       printf(" Object with name %s is a dataset \n", name);
       break;
  case H5G_TYPE:
       printf(" Object with name %s is a named datatype \n", name);
       break;
  default:
       printf(" Unable to identify an object ");
  }
  return 0;
}
--- ---

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dougherty_Matthew_T1 · July 13, 2008, 9:43pm

Does anyone else have any experience with storing this large number of datasets?

I have noticed when you get beyond 10k datasets/file the performance begins to drop.
by the time you hit 200K datasets, it is non-functional.

I must emphasize that this is data objects, not dimensional size of a single dataset.
Have not noticed problems with size versus a single dataset.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

-----Original Message-----
From: James Philbin [mailto:philbinj@gmail.com]
Sent: Sun 7/13/2008 4:34 PM
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] Large number of datasets

Hi,

I'm planning on using hdf5 for storing data on large image datasets
(1M+ images). I'd ideally like to store the data from each image as a
separate dataset in an hdf5 file. When I experimented previously with
having a flat hierarchy (i.e. each image linked to the root group), it
seemed hdf5 became extremely slow at performing almost any operation
(iteration,etc). When I queried this on the PyTables mailing list the
consensus seemed to be that using a hierarchy so that the number of
nodes in a group didn't exceed 256 was required to maintain acceptable
speeds (with v 1.6.5). At the time I didn't follow this suggestion up,
but i'm wondering now with hdf5 1.8.0 whether this situation has
changed at all?

Does anyone else have any experience with storing this large number of datasets?

Many thanks,
James

These were the c programs I used to stress test 1.6.5 (creation was
fairly fast but iteration was >1sec per dataset):
--- hdf5_stress_test.c ---
#include <stdio.h>
#include <stdlib.h>

#include "H5LT.h"

int
main(void)
{
hid_t file_id;
hsize_t dims[2];
int data[256];
char dset_name[32];
herr_t status;
int i;
int total = 1000000;

dims[0] = 16;
dims[1] = 16;

file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC,
H5P_DEFAULT, H5P_DEFAULT);

for (i=0; i<total; ++i) {
   sprintf(dset_name, "/dset_%07d", i);
   status = H5LTmake_dataset(file_id, dset_name, 2, dims,
H5T_NATIVE_INT, data);
   if (!(i%1000))
     printf("\r[%07d/%07d]", i, total);
   fflush(stdout);
}
status = H5Fclose(file_id);

return 0;
}
--- ---

--- hdf5_iterate.c ---
#include "hdf5.h"

herr_t file_info(hid_t loc_id, const char *name, void *opdata);

int
main(void)
{
hid_t file;
hid_t dataset;
hid_t group;

file = H5Fopen("hdf5_stress_test.h5", H5F_ACC_RDONLY, H5P_DEFAULT);

H5Giterate(file, "/", NULL, file_info, NULL);

H5Fclose(file);

return 0;
}

herr_t file_info(hid_t loc_id, const char *name, void *opdata)
{
H5G_stat_t statbuf;

   /*
    * Get type of the object and display its name and type.
    * The name of the object is passed to this function by
    * the Library. Some magic
    */
   H5Gget_objinfo(loc_id, name, 0, &statbuf);
   switch (statbuf.type) {
   case H5G_GROUP:
        printf(" Object with name %s is a group \n", name);
        break;
   case H5G_DATASET:
        printf(" Object with name %s is a dataset \n", name);
        break;
   case H5G_TYPE:
        printf(" Object with name %s is a named datatype \n", name);
        break;
   default:
        printf(" Unable to identify an object ");
   }
   return 0;
}
--- ---

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · July 13, 2008, 10:15pm

Hi Matthew!

>Does anyone else have any experience with storing this large number of datasets?

I have noticed when you get beyond 10k datasets/file the performance begins to drop.
by the time you hit 200K datasets, it is non-functional.

I must emphasize that this is data objects, not dimensional size of a single dataset.
Have not noticed problems with size versus a single dataset.

We're still working on the issues for your particular use case (which might be similar to James') and should have a snapshot for you to test out when it's working properly. It appears that the current file free space tracking algorithm was performing very poorly in that situation, so we've spent some time updating it and will be running some benchmarks on the new algorithm shortly. We'll make those public after we're certain that things are working correctly.

Quincey

···

On Jul 13, 2008, at 4:43 PM, Dougherty, Matthew T. wrote:

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: James Philbin [mailto:philbinj@gmail.com]
Sent: Sun 7/13/2008 4:34 PM
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] Large number of datasets

Hi,

I'm planning on using hdf5 for storing data on large image datasets
(1M+ images). I'd ideally like to store the data from each image as a
separate dataset in an hdf5 file. When I experimented previously with
having a flat hierarchy (i.e. each image linked to the root group), it
seemed hdf5 became extremely slow at performing almost any operation
(iteration,etc). When I queried this on the PyTables mailing list the
consensus seemed to be that using a hierarchy so that the number of
nodes in a group didn't exceed 256 was required to maintain acceptable
speeds (with v 1.6.5). At the time I didn't follow this suggestion up,
but i'm wondering now with hdf5 1.8.0 whether this situation has
changed at all?

Does anyone else have any experience with storing this large number of datasets?

Many thanks,
James

These were the c programs I used to stress test 1.6.5 (creation was
fairly fast but iteration was >1sec per dataset):
--- hdf5_stress_test.c ---
#include <stdio.h>
#include <stdlib.h>

#include "H5LT.h"

int
main(void)
{
hid_t file_id;
hsize_t dims[2];
int data[256];
char dset_name[32];
herr_t status;
int i;
int total = 1000000;

dims[0] = 16;
dims[1] = 16;

file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC,
H5P_DEFAULT, H5P_DEFAULT);

for (i=0; i<total; ++i) {
   sprintf(dset_name, "/dset_%07d", i);
   status = H5LTmake_dataset(file_id, dset_name, 2, dims,
H5T_NATIVE_INT, data);
   if (!(i%1000))
     printf("\r[%07d/%07d]", i, total);
   fflush(stdout);
}
status = H5Fclose(file_id);

return 0;
}
--- ---

--- hdf5_iterate.c ---
#include "hdf5.h"

herr_t file_info(hid_t loc_id, const char *name, void *opdata);

int
main(void)
{
hid_t file;
hid_t dataset;
hid_t group;

file = H5Fopen("hdf5_stress_test.h5", H5F_ACC_RDONLY, H5P_DEFAULT);

H5Giterate(file, "/", NULL, file_info, NULL);

H5Fclose(file);

return 0;
}

herr_t file_info(hid_t loc_id, const char *name, void *opdata)
{
   H5G_stat_t statbuf;

   /*
    * Get type of the object and display its name and type.
    * The name of the object is passed to this function by
    * the Library. Some magic
    */
   H5Gget_objinfo(loc_id, name, 0, &statbuf);
   switch (statbuf.type) {
   case H5G_GROUP:
        printf(" Object with name %s is a group \n", name);
        break;
   case H5G_DATASET:
        printf(" Object with name %s is a dataset \n", name);
        break;
   case H5G_TYPE:
        printf(" Object with name %s is a named datatype \n", name);
        break;
   default:
        printf(" Unable to identify an object ");
   }
   return 0;
}
--- ---

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

James_Philbin · July 14, 2008, 8:06am

We're still working on the issues for your particular use case (which
might be similar to James')

Just to confirm that this is indeed my use case -- I have a large
number of individual datasets, not a single large dataset.

256 nodes/group? We've changed this number to 4096 some years ago.

My mistake, re-reading my emails, I see that this was indeed the
number I was told. However 4096 is still a long way from 1M!

I suppose my question is really: How can I deal with this large number
of datasets now, given the limitations that hdf5 currently has?

Thanks,
James

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · July 14, 2008, 7:44am

A Sunday 13 July 2008, James Philbin escrigué:

Hi,

I'm planning on using hdf5 for storing data on large image datasets
(1M+ images). I'd ideally like to store the data from each image as a
separate dataset in an hdf5 file. When I experimented previously with
having a flat hierarchy (i.e. each image linked to the root group),
it seemed hdf5 became extremely slow at performing almost any
operation (iteration,etc). When I queried this on the PyTables
mailing list the consensus seemed to be that using a hierarchy so
that the number of nodes in a group didn't exceed 256 was required to
maintain acceptable speeds (with v 1.6.5).

256 nodes/group? We've changed this number to 4096 some years ago.
Maybe we forgot to check out this number before giving you advice :-/
At any rate, I'll be happy to release the warning that is issued when
this limit is reached as soon as the HDF crew solves this definitely.

Cheers,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · July 14, 2008, 10:33am

Hi James,

···

On Jul 14, 2008, at 3:06 AM, James Philbin wrote:

We're still working on the issues for your particular use case (which
might be similar to James')

Just to confirm that this is indeed my use case -- I have a large
number of individual datasets, not a single large dataset.

256 nodes/group? We've changed this number to 4096 some years ago.

My mistake, re-reading my emails, I see that this was indeed the
number I was told. However 4096 is still a long way from 1M!

I suppose my question is really: How can I deal with this large number
of datasets now, given the limitations that hdf5 currently has?

Try using the H5Pset_libver_bounds() I suggested earlier, that should ease things until we've finished this next section of coding.

Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Large number of datasets

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA