I've recently started using HDF5 to store ~2.5GB datasets, with the hope to
move to much much larger datasets in the future. However, accessing my data
seems to be slower than I had hoped, and I'm wondering if I'm going about
things the wrong way.
The data is a 3D image set. It is a set of ~3 Billion voxels, with
dimensions of 1760x1024x1878. The data is accessed in all three dimensions
to create 2D images. For example, I could create an image of the "slice" at
325 in the Z dimension to create a 2D image that is 1760x1024, or I could
create an image from 325 in the X direction to create an image that is
1024x1878.
I'm storing the data in the HDF5 file as a 4D array (the 4th dimension being
the RGB info for the voxel). I'm using a chunk size of 32x32x32x3. The data
is gzipped. The data was put into the file using PyTables, and is being read
by C++. The metadata output of HDFView looks like:
8-bit unsigned character, 1878 x 1024 x 1760 x 3
Number of attributes = 3
CLASS = CARRAY
VERSION =1.0
TITLE =
I'm accessing the data using hyperslabs, much like the example code. I can
provide code snippets if necessary.
The problem is the time accessing the data. It is quite a bit slower than I
had hoped.
X slab: 2.17553 seconds
Y slab: 4.19333 seconds
Z slab: 3.09807 seconds
This is on OS X, with a 2.8 GHz i7 processor and 8GB of memory. I saw
similar results when I ran some tests on an EC2 Linux box.
I didn't find anything on the HDF5 website or elsewhere describing best
practices when it comes to laying out the data. Is there a better way? Is
there some trick that I'm missing? The biggest concern is moving to larger
datasets, and whether the data access slows down even more.
Thanks for any help!
Eric Reid