Opening datasets expensive?

Jim_Robinson · January 17, 2008, 4:31am

Hi, I am using HDF5 as the backend for a genomics visualizer. The data is organized by experiement, chromosome, and resolution scale. A typical file might have 300 or so experiments, 24 chromosomes, and 8 resolution scales. My current design uses a for each experiment, chromosome, and resolution scale, or 57,600 datasets in all.

First question, is that too many datasets? I could combine the experiment and chromosome dimensions with a corresponding reduction in the number of datasets and increase in each datasets size. It would complicate the application code but is doable.

The application is a visualization and needs to access small portions of each dataset very quickly. It is organized similar to google maps and as the user zooms and pans small slices or datasets are accessed and rendered. The number of datasets accessed at one time is equal to the number of experiments. It is working fine with small numbers of experiments, < 20, but panning and zooming is noticeably sluggish with 300. I did some profiling and discoverd that about 70% of the time is spent just opening the datasets. Is this to be expected? Is it good practice to have a few large datasets rather than many smaller ones?

Oh, I'm using the java jni wrapper (H5). I am not using the object api, just the jni wrapper functions.

Thanks for any tips.

Jim Robinson
Broad Institute

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

epourmal · January 17, 2008, 5:03am

Jim,

It is a known performance problem related to the behavior of the HDF5 metadata cache. We have a fix and will be testing it in the next few days.

Would you like to get a tar ball when it is available and see if the fix addresses the problem? The fix will be in the 1.8 branch.

Elena

···

At 11:31 PM -0500 1/16/08, Jim Robinson wrote:

Hi, I am using HDF5 as the backend for a genomics visualizer. The data is organized by experiement, chromosome, and resolution scale. A typical file might have 300 or so experiments, 24 chromosomes, and 8 resolution scales. My current design uses a for each experiment, chromosome, and resolution scale, or 57,600 datasets in all.

First question, is that too many datasets? I could combine the experiment and chromosome dimensions with a corresponding reduction in the number of datasets and increase in each datasets size. It would complicate the application code but is doable.

The application is a visualization and needs to access small portions of each dataset very quickly. It is organized similar to google maps and as the user zooms and pans small slices or datasets are accessed and rendered. The number of datasets accessed at one time is equal to the number of experiments. It is working fine with small numbers of experiments, < 20, but panning and zooming is noticeably sluggish with 300. I did some profiling and discoverd that about 70% of the time is spent just opening the datasets. Is this to be expected? Is it good practice to have a few large datasets rather than many smaller ones?
Oh, I'm using the java jni wrapper (H5). I am not using the object api, just the jni wrapper functions.

Thanks for any tips.

Jim Robinson
Broad Institute

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--

------------------------------------------------------------
Elena Pourmal
The HDF Group
1901 So First ST.
Suite C-2
Champaign, IL 61820

epourmal@hdfgroup.org
(217)333-0238 (office)
(217)333-9049 (fax)
------------------------------------------------------------

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Altet · January 17, 2008, 12:23pm

A Thursday 17 January 2008, Jim Robinson escrigué:

Hi, I am using HDF5 as the backend for a genomics visualizer. The
data is organized by experiement, chromosome, and resolution scale.
A typical file might have 300 or so experiments, 24 chromosomes, and
8 resolution scales. My current design uses a for each experiment,
chromosome, and resolution scale, or 57,600 datasets in all.

First question, is that too many datasets? I could combine the
experiment and chromosome dimensions with a corresponding reduction
in the number of datasets and increase in each datasets size. It
would complicate the application code but is doable.

The application is a visualization and needs to access small portions
of each dataset very quickly. It is organized similar to google maps
and as the user zooms and pans small slices or datasets are accessed
and rendered. The number of datasets accessed at one time is equal
to the number of experiments. It is working fine with small numbers
of experiments, < 20, but panning and zooming is noticeably
sluggish with 300. I did some profiling and discoverd that about
70% of the time is spent just opening the datasets. Is this to be
expected? Is it good practice to have a few large datasets
rather than many smaller ones?

My experience says that it is definitely better to have fewer large
datasets than many smaller ones.

Even if, as Elena is saying, there is a bug in the metadata cache that
the THG people is fixing, if you want maximum speed to access parts of
your data, my guess is that accessing it in different parts of a large
dataset would be always faster than accessing different datasets. This
is because a dataset has more metadata that has to be retrieved from
disk, while for accessing parts of a large dataset you only have to
read the part of the btree to reach it (if not yet in memory) and the
data itself, which is pretty fast.

My 2 cents,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dougherty_Matthew_T1 · January 18, 2008, 9:29am

my preliminary performance analysis on writing 2D 64x64 32bit pixel images:
1) stdio and HDF are comparable up to 10k images
2) HDF performance badly diverges non linearly after 10k images, stdio is linear.
3) By 200k images HDF is 20x slower.
4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack, HDF is 20-40% faster than stdio.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

-----Original Message-----
From: Francesc Altet [mailto:faltet@carabos.com]
Sent: Thu 1/17/2008 6:23 AM
To: hdf-forum@hdfgroup.org
Subject: Re: Opening datasets expensive?

A Thursday 17 January 2008, Jim Robinson escrigué:

Hi, I am using HDF5 as the backend for a genomics visualizer. The
data is organized by experiement, chromosome, and resolution scale.
A typical file might have 300 or so experiments, 24 chromosomes, and
8 resolution scales. My current design uses a for each experiment,
chromosome, and resolution scale, or 57,600 datasets in all.

First question, is that too many datasets? I could combine the
experiment and chromosome dimensions with a corresponding reduction
in the number of datasets and increase in each datasets size. It
would complicate the application code but is doable.

The application is a visualization and needs to access small portions
of each dataset very quickly. It is organized similar to google maps
and as the user zooms and pans small slices or datasets are accessed
and rendered. The number of datasets accessed at one time is equal
to the number of experiments. It is working fine with small numbers
of experiments, < 20, but panning and zooming is noticeably
sluggish with 300. I did some profiling and discoverd that about
70% of the time is spent just opening the datasets. Is this to be
expected? Is it good practice to have a few large datasets
rather than many smaller ones?

My experience says that it is definitely better to have fewer large
datasets than many smaller ones.

Even if, as Elena is saying, there is a bug in the metadata cache that
the THG people is fixing, if you want maximum speed to access parts of
your data, my guess is that accessing it in different parts of a large
dataset would be always faster than accessing different datasets. This
is because a dataset has more metadata that has to be retrieved from
disk, while for accessing parts of a large dataset you only have to
read the part of the btree to reach it (if not yet in memory) and the
data itself, which is pretty fast.

My 2 cents,

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · January 18, 2008, 11:00am

Hi all,

I've only tested with large datasets using:
1) raw files
2) opaque datatypes (serialize with another lib and save data as opaque
type)
3) native HDF structures with variable arrays
4) In memory buffered native HDF structures
5) Breakdown of structures to HDF5 native arrays

These are my results for writing files ranging in total from 1GB to 100GB:

Opaque data type writing is always faster than raw file, but not accounting
for the time to serialize.
Writing HDF5 native structures, especially with variable arrays is always
slower up to a factor of 2
Rearranging data into fixed size arrays is usually about 20% slower

On NFS things seem to be different, with HDF5 outperforming raw file in most
cases.

Reading a dataset is usually faster with the raw file but faster with HDF5
after the first time!

Overwriting a dataset is always significantly faster with HDF5.

In all cases, writing opaque datatypes is always faster.

HTH

-- dimitris

···

2008/1/18, Dougherty, Matthew T. <matthewd@bcm.tmc.edu>:

Hi Dimitris,

Looks like you did the same thing I did.
Francesc pointed it out to me a few minutes ago, below is his email to me

========

Matthew,

Interesting experiment. BTW, you have only send it to me, which is
great, but are you sure that you don't want to share this with the rest
of the HDF list?

Cheers, Francesc

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com <servisster@gmail.com>]
Sent: Fri 1/18/2008 4:38 AM
To: Dougherty, Matthew T.
Subject: Re: Opening datasets expensive?

Hi all,

I've only tested with large datasets using:
1) raw files
2) opaque datatypes (serialize with another lib and save data as opaque
type)
3) native HDF structures with variable arrays
4) In memory buffered native HDF structures
5) Breakdown of structures to HDF5 native arrays

These are my results for writing files ranging in total from 1GB to 100GB:

Opaque data type writing is always faster than raw file, but not
accounting
for the time to serialize.
Writing HDF5 native structures, especially with variable arrays is always
slower up to a factor of 2
Rearranging data into fixed size arrays is usually about 20% slower

On NFS things seem to be different, with HDF5 outperforming raw file in
most
cases.

Reading a dataset is usually faster with the raw file but faster with HDF5
after the first time!

Overwriting a dataset is always significantly faster with HDF5.

In all cases, writing opaque datatypes is always faster.

HTH

-- dimitris

2008/1/18, Dougherty, Matthew T. <matthewd@bcm.tmc.edu>:
>
> my preliminary performance analysis on writing 2D 64x64 32bit pixel
> images:
> 1) stdio and HDF are comparable up to 10k images
> 2) HDF performance badly diverges non linearly after 10k images, stdio
is
> linear.
> 3) By 200k images HDF is 20x slower.
> 4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack,
> HDF is 20-40% faster than stdio.
>
>
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
>

>

>
>
>
>
>
> -----Original Message-----
> From: Francesc Altet [mailto:faltet@carabos.com <faltet@carabos.com> <
faltet@carabos.com>]
> Sent: Thu 1/17/2008 6:23 AM
> To: hdf-forum@hdfgroup.org
> Subject: Re: Opening datasets expensive?
>
> A Thursday 17 January 2008, Jim Robinson escrigué:
> > Hi, I am using HDF5 as the backend for a genomics visualizer. The
> > data is organized by experiement, chromosome, and resolution scale.
> > A typical file might have 300 or so experiments, 24 chromosomes, and
> > 8 resolution scales. My current design uses a for each experiment,
> > chromosome, and resolution scale, or 57,600 datasets in all.
> >
> > First question, is that too many datasets? I could combine the
> > experiment and chromosome dimensions with a corresponding reduction
> > in the number of datasets and increase in each datasets size. It
> > would complicate the application code but is doable.
> >
> > The application is a visualization and needs to access small portions
> > of each dataset very quickly. It is organized similar to google maps
> > and as the user zooms and pans small slices or datasets are accessed
> > and rendered. The number of datasets accessed at one time is equal
> > to the number of experiments. It is working fine with small numbers
> > of experiments, < 20, but panning and zooming is noticeably
> > sluggish with 300. I did some profiling and discoverd that about
> > 70% of the time is spent just opening the datasets. Is this to be
> > expected? Is it good practice to have a few large datasets
> > rather than many smaller ones?
>
> My experience says that it is definitely better to have fewer large
> datasets than many smaller ones.
>
> Even if, as Elena is saying, there is a bug in the metadata cache that
> the THG people is fixing, if you want maximum speed to access parts of
> your data, my guess is that accessing it in different parts of a large
> dataset would be always faster than accessing different datasets. This
> is because a dataset has more metadata that has to be retrieved from
> disk, while for accessing parts of a large dataset you only have to
> read the part of the btree to reach it (if not yet in memory) and the
> data itself, which is pretty fast.
>
> My 2 cents,
>
> --
> >0,0< Francesc Altet http://www.carabos.com/
> V V Cárabos Coop. V. Enjoy Data
> "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
>
>
>
>

--
What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.

--
What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.

Quincey_Koziol · January 19, 2008, 3:21pm

my preliminary performance analysis on writing 2D 64x64 32bit pixel images:
1) stdio and HDF are comparable up to 10k images
2) HDF performance badly diverges non linearly after 10k images, stdio is linear.
3) By 200k images HDF is 20x slower.

I'm pretty confident that our next set of performance improvements will remedy this slowdown at the larger scales. We'll see in a week or so...

Quincey

···

On Jan 18, 2008, at 3:29 AM, Dougherty, Matthew T. wrote:

4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack, HDF is 20-40% faster than stdio.

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: Francesc Altet [mailto:faltet@carabos.com]
Sent: Thu 1/17/2008 6:23 AM
To: hdf-forum@hdfgroup.org
Subject: Re: Opening datasets expensive?

A Thursday 17 January 2008, Jim Robinson escrigué:
> Hi, I am using HDF5 as the backend for a genomics visualizer. The
> data is organized by experiement, chromosome, and resolution scale.
> A typical file might have 300 or so experiments, 24 chromosomes, and
> 8 resolution scales. My current design uses a for each experiment,
> chromosome, and resolution scale, or 57,600 datasets in all.
>
> First question, is that too many datasets? I could combine the
> experiment and chromosome dimensions with a corresponding reduction
> in the number of datasets and increase in each datasets size. It
> would complicate the application code but is doable.
>
> The application is a visualization and needs to access small portions
> of each dataset very quickly. It is organized similar to google maps
> and as the user zooms and pans small slices or datasets are accessed
> and rendered. The number of datasets accessed at one time is equal
> to the number of experiments. It is working fine with small numbers
> of experiments, < 20, but panning and zooming is noticeably
> sluggish with 300. I did some profiling and discoverd that about
> 70% of the time is spent just opening the datasets. Is this to be
> expected? Is it good practice to have a few large datasets
> rather than many smaller ones?

My experience says that it is definitely better to have fewer large
datasets than many smaller ones.

Even if, as Elena is saying, there is a bug in the metadata cache that
the THG people is fixing, if you want maximum speed to access parts of
your data, my guess is that accessing it in different parts of a large
dataset would be always faster than accessing different datasets. This
is because a dataset has more metadata that has to be retrieved from
disk, while for accessing parts of a large dataset you only have to
read the part of the btree to reach it (if not yet in memory) and the
data itself, which is pretty fast.

My 2 cents,

--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Opening datasets expensive?

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

>

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA