2008/1/18, Dougherty, Matthew T. <matthewd@bcm.tmc.edu>:
Hi Dimitris,
Looks like you did the same thing I did.
Francesc pointed it out to me a few minutes ago, below is his email to me
========
Matthew,
Interesting experiment. BTW, you have only send it to me, which is
great, but are you sure that you don't want to share this with the rest
of the HDF list? 
Cheers, Francesc
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com <servisster@gmail.com>]
Sent: Fri 1/18/2008 4:38 AM
To: Dougherty, Matthew T.
Subject: Re: Opening datasets expensive?
Hi all,
I've only tested with large datasets using:
1) raw files
2) opaque datatypes (serialize with another lib and save data as opaque
type)
3) native HDF structures with variable arrays
4) In memory buffered native HDF structures
5) Breakdown of structures to HDF5 native arrays
These are my results for writing files ranging in total from 1GB to 100GB:
Opaque data type writing is always faster than raw file, but not
accounting
for the time to serialize.
Writing HDF5 native structures, especially with variable arrays is always
slower up to a factor of 2
Rearranging data into fixed size arrays is usually about 20% slower
On NFS things seem to be different, with HDF5 outperforming raw file in
most
cases.
Reading a dataset is usually faster with the raw file but faster with HDF5
after the first time!
Overwriting a dataset is always significantly faster with HDF5.
In all cases, writing opaque datatypes is always faster.
HTH
-- dimitris
2008/1/18, Dougherty, Matthew T. <matthewd@bcm.tmc.edu>:
>
> my preliminary performance analysis on writing 2D 64x64 32bit pixel
> images:
> 1) stdio and HDF are comparable up to 10k images
> 2) HDF performance badly diverges non linearly after 10k images, stdio
is
> linear.
> 3) By 200k images HDF is 20x slower.
> 4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack,
> HDF is 20-40% faster than stdio.
>
>
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
>
>
>
>
>
>
>
> -----Original Message-----
> From: Francesc Altet [mailto:faltet@carabos.com <faltet@carabos.com> <
faltet@carabos.com>]
> Sent: Thu 1/17/2008 6:23 AM
> To: hdf-forum@hdfgroup.org
> Subject: Re: Opening datasets expensive?
>
> A Thursday 17 January 2008, Jim Robinson escrigué:
> > Hi, I am using HDF5 as the backend for a genomics visualizer. The
> > data is organized by experiement, chromosome, and resolution scale.
> > A typical file might have 300 or so experiments, 24 chromosomes, and
> > 8 resolution scales. My current design uses a for each experiment,
> > chromosome, and resolution scale, or 57,600 datasets in all.
> >
> > First question, is that too many datasets? I could combine the
> > experiment and chromosome dimensions with a corresponding reduction
> > in the number of datasets and increase in each datasets size. It
> > would complicate the application code but is doable.
> >
> > The application is a visualization and needs to access small portions
> > of each dataset very quickly. It is organized similar to google maps
> > and as the user zooms and pans small slices or datasets are accessed
> > and rendered. The number of datasets accessed at one time is equal
> > to the number of experiments. It is working fine with small numbers
> > of experiments, < 20, but panning and zooming is noticeably
> > sluggish with 300. I did some profiling and discoverd that about
> > 70% of the time is spent just opening the datasets. Is this to be
> > expected? Is it good practice to have a few large datasets
> > rather than many smaller ones?
>
> My experience says that it is definitely better to have fewer large
> datasets than many smaller ones.
>
> Even if, as Elena is saying, there is a bug in the metadata cache that
> the THG people is fixing, if you want maximum speed to access parts of
> your data, my guess is that accessing it in different parts of a large
> dataset would be always faster than accessing different datasets. This
> is because a dataset has more metadata that has to be retrieved from
> disk, while for accessing parts of a large dataset you only have to
> read the part of the btree to reach it (if not yet in memory) and the
> data itself, which is pretty fast.
>
> My 2 cents,
>
> --
> >0,0< Francesc Altet http://www.carabos.com/
> V V Cárabos Coop. V. Enjoy Data
> "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
>
>
>
>
--
What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.