hdf read/write performance tuning

John_Knutson · April 16, 2010, 10:32pm

I've been doing some experimenting in an attempt to determine whether there is an optimal read/write size and what it is, but I find that there are questions I need answers that I haven't found yet... Hopefully someone here can provide

Our data is going to be in datasets that consist of anywhere from ~1K to ~1M to ~10M compound types of relatively small (~170 bytes) size... My read tests suggest that the optimal number of records to read from the file at once is between 2048 and 4096... The data was compressed for these tests.

What drives that optimal read size? For these tests, only one dataset was written, so I assume that the data was all contiguous.

Chunking puzzles me. At first, I thought it was a number of bytes (I haven't been able to find any documentation that explicitly says whether it's a number of bytes, a number of records, or what), but now I'm not sure. Again, I did some experiments and found that there was a bit of extra overhead with a chunk size of 1, but there really wasn't much difference between a chunk size of 128, 512, or 2048 (in terms of writing speed, mind you, there's definitely a difference in file size). That said, when I tried the same test using a chunk size of 10240 and it slowed down enough that I didn't bother letting it finish. After playing a bit more, it seems the largest chunk size I can pick (in whatever units it happens to be in), is 6553, with it completing in a reasonable time frame (processing time increases by two orders of magnitude going from 6553 to 6554).

So what drives optimal chunking size, if your concern is 1) reading quickly, and 2) writing quickly, in that order. Obviously, the files are a lot smaller with the larger chunk sizes, but why does the processing time suddenly skyrocket going from 6553 to 6554? What units is the chunk size specified in?

Thanks for any answers!

werner · April 16, 2010, 10:34pm

Hi,

units of chunk size are in elements of the data set's type.

If you have a chunk size of 1, then it will create as many chunks as there
are elements in the dataset, so you'll have more overhead than data.

Usually you might not even need chunked datasets, but those are
required if you want to compress the data, or the entire dataset
doesnt fit into RAM, or you have a special access pattern that would
benefit from chunking.

Why it goes up so badly from 6553 to 6554 I don't know, might depend
on the size of the entire dataset, and how chunk accesses are mapped
into the full set. It might be just a very inefficient mapping.

Werner

···

On Fri, 16 Apr 2010 18:32:31 -0400, John Knutson <jkml@arlut.utexas.edu> wrote:

I've been doing some experimenting in an attempt to determine whether
there is an optimal read/write size and what it is, but I find that
there are questions I need answers that I haven't found yet... Hopefully
someone here can provide

Our data is going to be in datasets that consist of anywhere from ~1K to
~1M to ~10M compound types of relatively small (~170 bytes) size... My
read tests suggest that the optimal number of records to read from the
file at once is between 2048 and 4096... The data was compressed for
these tests.

What drives that optimal read size? For these tests, only one dataset
was written, so I assume that the data was all contiguous.

Chunking puzzles me. At first, I thought it was a number of bytes (I
haven't been able to find any documentation that explicitly says whether
it's a number of bytes, a number of records, or what), but now I'm not
sure. Again, I did some experiments and found that there was a bit of
extra overhead with a chunk size of 1, but there really wasn't much
difference between a chunk size of 128, 512, or 2048 (in terms of
writing speed, mind you, there's definitely a difference in file size).
That said, when I tried the same test using a chunk size of 10240 and it
slowed down enough that I didn't bother letting it finish. After
playing a bit more, it seems the largest chunk size I can pick (in
whatever units it happens to be in), is 6553, with it completing in a
reasonable time frame (processing time increases by two orders of
magnitude going from 6553 to 6554).

So what drives optimal chunking size, if your concern is 1) reading
quickly, and 2) writing quickly, in that order. Obviously, the files
are a lot smaller with the larger chunk sizes, but why does the
processing time suddenly skyrocket going from 6553 to 6554? What units
is the chunk size specified in?

Thanks for any answers!

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362