Hi Douglas,
A Tuesday 04 November 2008, Douglas Eck escrigué:
> Hi all,
> I have an hdf file containing this array :
>
> DATASET "bfeat_mcmc_array" {
> DATATYPE H5T_IEEE_F64LE
> DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
> ATTRIBUTE "CLASS" {
> DATATYPE H5T_STRING {
> STRSIZE 7;
> STRPAD H5T_STR_NULLTERM;
> CSET H5T_CSET_ASCII;
> CTYPE H5T_C_S1;
> }
> DATASPACE SCALAR
> }
>
>
> I need to access it both row-wise and column-wise. I would like to
> store a transposed version (size 6109666 x 360) to make it easier to
> read out a single vector of size 6109666 x 1
>
> What's the best way to do this.
I don't think you need to transpose the dataset: why not just copy it with
a sensible chunksize in destination?. As you are using PyTables, in
forthcoming 2.1 version [1] the .copy() method of leaves will support the
'chunkshape' argument, so this could be done quite easily:
# 'a' is the original 'dim1xdim2' dataset
newchunkshape = (1, a.chunkshape[0]*a.chunkshape[1])
b = a.copy(f.root, "b", chunkshape=newchunkshape)
# 'b' contains the dataset with an optimized chunkshape for reading rows
And that's all. As I was curious about the improvements on this approach, I
have created a small benchmark (attached) and here are the results for your
dataset:
================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)
================================
Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 611.177 sec (27.5 MB/s)
Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)
================================
Speed-up with a row-wise chunkshape: 27.9
Mmh, it seems like I'm not getting the most out of my disk (14 MB/s is too
few) here. So, perhaps this is the effect of the large HDF5 hash table for
accessing actual data. So as to confirm this, I choose a chunkshape 10x
larger, up to 1.2 MB (that will make the size of the HDF5 hash table to
decrease). Here it is the new result:
<snip>
Chunkshape for row-wise chunkshape array: (1, 162000)
Time to copy the original array: 379.388 sec (44.2 MB/s)
Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)
================================
Speed-up with a row-wise chunkshape: 111.6
Ok. Now I'm getting a decent performance for the new dataset. It is also
worth noting that the copy speed has been accelerated by a 60% (I'd say that
it is pretty optimal now). I've also tried out bigger chunksizes, but the
performance drops quite a lot. Definitely a chunkshape of (1, 162000),
allowing for more than 100x of speed-up over the original setting, seems
good enough for this case.
Incidentally, you could always do the copy manually:
# 'a' is the original 'dim1xdim2' dataset
b = f.createEArray(f.root, "b", tables.Float64Atom(),
shape = (0, dim2), chunkshape=(1, 162000))
for i in xrange(dim1):
b.append([a[i]])
# 'b' contains the dataset with an optimized chunkshape for reading rows
but this method is much more expensive (perhaps more than 10x) than using
the .copy() method (this is because the I/O is optimized during copy
operations). However, I'd say that the read throughput would be similar.
> I'd love to see some magic utility such as
>
>
> hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5
In forthcoming PyTables 2.1 [1] you will be able to do:
$ ptrepack /tmp/test.h5:/a /tmp/test2.h5:/a
$ ptrepack --chunkshape='(1, 162000)' /tmp/test.h5:/a /tmp/test2.h5:/b
and then use the test2.h5 for your purposes.
[1] http://www.pytables.org/download/preliminary/
Hope that helps,
--
Francesc Alted
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.