Writing Performance

Hi,

I've been looking into HDF5 as possible data storage for an application.

I'd appreciate some assistance with understanding writing performance.

The data that is being written is a compound data type of 28 bytes (one
unsigned 8 byte int, four 4 byte floats and one unsigned 4 byte int; 1 x
STD_U64LE, 4 x IEEE_F32LE, 1 x STD_U32LE).

The chunk size is set to 5500 elements, cache settings are at their default.

The writing is done 1 row at a time (appending to the end, using the dataset
API, not the packet table or the table API) and the performance I get is
around 200,000 rows per second which is below my expectations.

1. Is this the expected performance or am I possibly doing something wrong?

2. If I understand correctly, even though I'm writing 1 row at a time, the
data isn't actually being written to the disk until the chunk is evicted
from the cache and it is only at that point that the entire chunk gets
written to the disk (until then it's only writing to the chunk in the
cache), if this is true then I would expect the performance to be similar to
writing bulks of 5500 x 28 (chunk size * compound data type size) = 154,000
bytes to the HDD which I would expect to perform at least 5x better.

Is my understanding correct?

Does writing 1 record at a time cause overhead? If it does where is the
overhead coming from?

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Writing-Performance-tp3230432p3230432.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi,

Hi Charles,

1. Is this the expected performance or am I possibly doing something wrong?

Expected performance is tricky to answer, given how many variables are
involved. A good way to check the upper level of performance you can
expect from hdf5 on your machine is to use h5perf_serial.

2. If I understand correctly, even though I'm writing 1 row at a time, the
data isn't actually being written to the disk until the chunk is evicted
from the cache and it is only at that point that the entire chunk gets
written to the disk (until then it's only writing to the chunk in the
cache), if this is true then I would expect the performance to be similar to
writing bulks of 5500 x 28 (chunk size * compound data type size) = 154,000
bytes to the HDD which I would expect to perform at least 5x better.

Is my understanding correct?

Does writing 1 record at a time cause overhead? If it does where is the
overhead coming from?

All of the operations involved in writing data to a dataset have a
non-zero cost. resizing the dataset, allocating the read/write
dataspaces and performing the write all take some time, time that
you've now brought into your innermost loop.

A general strategy for investigating possible optimizations that
should serve you well continuing with HDF5 is to try several
configurations and compare their performance. That said, I would
definitely investigate writing more than one row to the dataset at a
time.

···

On Fri, Aug 5, 2011 at 11:09 PM, Charles Darwin <trokabg@yahoo.com> wrote:

--
Mike Davis
mikedavis@uchicago.edu

Hi Charles,

Hi,

I've been looking into HDF5 as possible data storage for an application.

I'd appreciate some assistance with understanding writing performance.

The data that is being written is a compound data type of 28 bytes (one
unsigned 8 byte int, four 4 byte floats and one unsigned 4 byte int; 1 x
STD_U64LE, 4 x IEEE_F32LE, 1 x STD_U32LE).

The chunk size is set to 5500 elements, cache settings are at their default.

The writing is done 1 row at a time (appending to the end, using the dataset
API, not the packet table or the table API) and the performance I get is
around 200,000 rows per second which is below my expectations.

1. Is this the expected performance or am I possibly doing something wrong?

2. If I understand correctly, even though I'm writing 1 row at a time, the
data isn't actually being written to the disk until the chunk is evicted
from the cache and it is only at that point that the entire chunk gets
written to the disk (until then it's only writing to the chunk in the
cache), if this is true then I would expect the performance to be similar to
writing bulks of 5500 x 28 (chunk size * compound data type size) = 154,000
bytes to the HDD which I would expect to perform at least 5x better.

Is my understanding correct?

  Yes.

Does writing 1 record at a time cause overhead? If it does where is the
overhead coming from?

  Take a look at the "add_records" code at this URL:

http://svn.hdfgroup.uiuc.edu/hdf5/branches/revise_chunks/test/swmr_writer.c

for code that is efficiently adding new elements to the end of a 1-D dataset. If you run that program and the performance is about the same as your application, you are probably seeing a limitation of the storage system you are dealing with.

  Quincey

···

On Aug 5, 2011, at 11:09 PM, Charles Darwin wrote:

Speaking of h5perf_serial, I ran it (v 1.8.6) on my test machine (win 2008 32bit) and I'm seeing strange results for the HDF5 Read measurement. For a single run, I see 0.0 or 19 MB/s or whatever. When I run many iterations, I get something like Maximum Throughput @ 0.0 MB/s, Average Throughput @ 72 MB/s, and Minimum Throughput 19 MB/s. I'm guessing a couple values aren't really set for this case. The results for other cases do look believable.

Scott

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-
bounces@hdfgroup.org] On Behalf Of Mike Davis
Sent: Monday, August 15, 2011 8:47 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Writing Performance

On Fri, Aug 5, 2011 at 11:09 PM, Charles Darwin <trokabg@yahoo.com> > wrote:
> Hi,

Hi Charles,

> 1. Is this the expected performance or am I possibly doing something
wrong?

Expected performance is tricky to answer, given how many variables are
involved. A good way to check the upper level of performance you can
expect from hdf5 on your machine is to use h5perf_serial.

> 2. If I understand correctly, even though I'm writing 1 row at a
time, the
> data isn't actually being written to the disk until the chunk is
evicted
> from the cache and it is only at that point that the entire chunk
gets
> written to the disk (until then it's only writing to the chunk in the
> cache), if this is true then I would expect the performance to be
similar to
> writing bulks of 5500 x 28 (chunk size * compound data type size) =
154,000
> bytes to the HDD which I would expect to perform at least 5x better.
>
> Is my understanding correct?
>
> Does writing 1 record at a time cause overhead? If it does where is
the
> overhead coming from?

All of the operations involved in writing data to a dataset have a
non-zero cost. resizing the dataset, allocating the read/write
dataspaces and performing the write all take some time, time that
you've now brought into your innermost loop.

A general strategy for investigating possible optimizations that
should serve you well continuing with HDF5 is to try several
configurations and compare their performance. That said, I would
definitely investigate writing more than one row to the dataset at a
time.

--
Mike Davis
mikedavis@uchicago.edu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender.
Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.