Transposing large array

Hi all,
I have an hdf file containing this array :

DATASET "bfeat_mcmc_array" {
      DATATYPE H5T_IEEE_F64LE
      DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
      ATTRIBUTE "CLASS" {
         DATATYPE H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE SCALAR
      }

I need to access it both row-wise and column-wise. I would like to store a
transposed version (size 6109666 x 360) to make it easier to read out a
single vector of size 6109666 x 1

What's the best way to do this. I'd love to see some magic utility such as
:

hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

:slight_smile:

I'm using pytables but could also do a quick .c program if necessary.

Thanks,
Doug Eck

Dr. Douglas Eck, Associate Professor
Université de Montréal, Department of Computer Science / BRAMS
CP 6128, Succ. Centre-Ville Montréal, Québec H3C 3J7 CANADA
Office: 3253 Pavillion Andre-Aisenstadt
Phone: 1-514-343-6111 ext 3520 Fax: 1-514-343-5834
http://www.iro.umontreal.ca/~eckdoug
Research Areas: Machine Learning and Music Cognition

Do you mean quick as in runs fast because you have bunch of those arrays,
or quick as in easy to code because you have one array and want to spend
less time coding than it will take to run the program for your one array?

There has been a lot of work on implementing transpose for arrays larger
than the working set (on parallel hardware). IDL and Matlab use such
approaches, but I'm not sure how widespread they are in free tools. For
specific architectures you can find very low-level libraries (Intel TBB+IBM
HTA), that can be used to (slowly) build quick-running programs.

···

On Tue, Nov 4, 2008 at 4:25 PM, Douglas Eck <eckdoug@iro.umontreal.ca> wrote:

Hi all,
I have an hdf file containing this array :

DATASET "bfeat_mcmc_array" {
      DATATYPE H5T_IEEE_F64LE
      DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
      ATTRIBUTE "CLASS" {
         DATATYPE H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE SCALAR
      }

I need to access it both row-wise and column-wise. I would like to store a
transposed version (size 6109666 x 360) to make it easier to read out a
single vector of size 6109666 x 1

What's the best way to do this. I'd love to see some magic utility such as
:

hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

I'm using pytables but could also do a quick .c program if necessary.

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Douglas,

A Tuesday 04 November 2008, Douglas Eck escrigué:

Hi all,
I have an hdf file containing this array :

DATASET "bfeat_mcmc_array" {
      DATATYPE H5T_IEEE_F64LE
      DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
      ATTRIBUTE "CLASS" {
         DATATYPE H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE SCALAR
      }

I need to access it both row-wise and column-wise. I would like to
store a transposed version (size 6109666 x 360) to make it easier to
read out a single vector of size 6109666 x 1

What's the best way to do this.

I don't think you need to transpose the dataset: why not just copy it with a sensible chunksize in destination?. As you are using PyTables, in forthcoming 2.1 version [1] the .copy() method of leaves will support the 'chunkshape' argument, so this could be done quite easily:

# 'a' is the original 'dim1xdim2' dataset
newchunkshape = (1, a.chunkshape[0]*a.chunkshape[1])
b = a.copy(f.root, "b", chunkshape=newchunkshape)
# 'b' contains the dataset with an optimized chunkshape for reading rows

And that's all. As I was curious about the improvements on this approach, I have created a small benchmark (attached) and here are the results for your dataset:

bench-chunksize.py (1.6 KB)

···

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 611.177 sec (27.5 MB/s)
Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)

Speed-up with a row-wise chunkshape: 27.9

Mmh, it seems like I'm not getting the most out of my disk (14 MB/s is too few) here. So, perhaps this is the effect of the large HDF5 hash table for accessing actual data. So as to confirm this, I choose a chunkshape 10x larger, up to 1.2 MB (that will make the size of the HDF5 hash table to decrease). Here it is the new result:

<snip>
Chunkshape for row-wise chunkshape array: (1, 162000)
Time to copy the original array: 379.388 sec (44.2 MB/s)
Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)

Speed-up with a row-wise chunkshape: 111.6

Ok. Now I'm getting a decent performance for the new dataset. It is also worth noting that the copy speed has been accelerated by a 60% (I'd say that it is pretty optimal now). I've also tried out bigger chunksizes, but the performance drops quite a lot. Definitely a chunkshape of (1, 162000), allowing for more than 100x of speed-up over the original setting, seems good enough for this case.

Incidentally, you could always do the copy manually:

# 'a' is the original 'dim1xdim2' dataset
b = f.createEArray(f.root, "b", tables.Float64Atom(),
                   shape = (0, dim2), chunkshape=(1, 162000))
for i in xrange(dim1):
    b.append([a[i]])
# 'b' contains the dataset with an optimized chunkshape for reading rows

but this method is much more expensive (perhaps more than 10x) than using the .copy() method (this is because the I/O is optimized during copy operations). However, I'd say that the read throughput would be similar.

I'd love to see some magic utility such as

hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

In forthcoming PyTables 2.1 [1] you will be able to do:

$ ptrepack /tmp/test.h5:/a /tmp/test2.h5:/a
$ ptrepack --chunkshape='(1, 162000)' /tmp/test.h5:/a /tmp/test2.h5:/b

and then use the test2.h5 for your purposes.

[1] http://www.pytables.org/download/preliminary/

Hope that helps,

--
Francesc Alted

A Wednesday 05 November 2008, Francesc Alted escrigué:
[snip]

And that's all. As I was curious about the improvements on this
approach, I have created a small benchmark (attached) and here are
the results for your dataset:

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 611.177 sec (27.5 MB/s)
Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)

Speed-up with a row-wise chunkshape: 27.9

Mmh, it seems like I'm not getting the most out of my disk (14 MB/s
is too few) here. So, perhaps this is the effect of the large HDF5
hash table for accessing actual data. So as to confirm this, I
choose a chunkshape 10x larger, up to 1.2 MB (that will make the size
of the HDF5 hash table to decrease). Here it is the new result:

<snip>
Chunkshape for row-wise chunkshape array: (1, 162000)
Time to copy the original array: 379.388 sec (44.2 MB/s)
Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)

Speed-up with a row-wise chunkshape: 111.6

Ok. Now I'm getting a decent performance for the new dataset. It is
also worth noting that the copy speed has been accelerated by a 60%
(I'd say that it is pretty optimal now). I've also tried out bigger
chunksizes, but the performance drops quite a lot. Definitely a
chunkshape of (1, 162000), allowing for more than 100x of speed-up
over the original setting, seems good enough for this case.

I was hooked on this and tried yet another chunkshape: (1, 324000).
This doubles the size of the latter. Here are the results:

···

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 275.9 sec (60.8 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)

Chunkshape for row-wise chunkshape array: (1, 324000)
Time to copy the original array: 284.88 sec (58.9 MB/s)
Time to read with a row-wise chunkshape: 3.508 sec (132.9 MB/s)

Speed-up with a row-wise chunkshape: 269.5

So, by doubling the chunkshape, you can get a further improvement of
almost 3x in read speed. Also, the copy is very efficient now (and
very close to creating the dataset anew, which is a bit
counter-intuitive :-/).

Most probably you could find a better figure by playing with other
values. This is to say that, although PyTables provides its own
guesses for chunkshapes (based on the estimated sizes of datasets), in
general there is no replacement for running your own experiments in
order to determine the chunkshape that works best for you.

Cheers,

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Francesc, Forum

First, thank you very much for your help. I now am beginning to understand
chunking, thanks to you!
I am trying your recommendations and having some problems. I am trying to
copy bfeat_mcmc.h5 to bfeat_mcmc_ptrepack.h5 using your recommended
approach. The file bfeat_mcmc.h5 contains a table bfeat_mcmc_table and an
array bfeat_mcmc_array. The table is small and is not an issue here. The
table looks like this:
Existing array bfeat_mcmc_array is type <class 'tables.earray.EArray'> shape
(360, 6109666) chunkshape (360, 2)
I don't know why the chunkshape is (360,2) but I now understand that this is
not a good chunkshape.
The attache file bfeat_mcmc.txt contains the output from hgdump -H
bfeat_mcmc.h5

I installed v2.1rc1
629 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>ipython
In [1]: import tables
In [2]: tables.__version__
Out[2]: '2.1rc1'

When I run ptrepack I get this error:
638 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>ptrepack
--overwrite-nodes --chunkshape='(1,324000)' bfeat_mcmc.h5:/bfeat_mcmc_array
bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array
Problems doing the copy from 'bfeat_mcmc.h5:/bfeat_mcmc_array' to
'bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array'
The error was --> <type 'exceptions.TypeError'>: _g_copyWithStats() got an
unexpected keyword argument 'propindexes'
The destination file looks like:
bfeat_mcmc_ptrepack.h5 (File) ''
Last modif.: 'Thu Nov 6 10:05:57 2008'
Object Tree:
/ (RootGroup) ''

Traceback (most recent call last):
  File "/u/eckdoug/share/bin/ptrepack", line 3, in <module>
    main()
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepack.py",
line 483, in main
    upgradeflavors=upgradeflavors)
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepack.py",
line 140, in copyLeaf
    raise RuntimeError, "Please check that the node names are not duplicated
in destination, and if so, add the --overwrite-nodes flag if desired."
RuntimeError: Please check that the node names are not duplicated in
destination, and if so, add the --overwrite-nodes flag if desired.

When I run the following code in python I let it run for ten hours and still
hadn't written even 0.1% of the new array. I checked memory usage. It
wasn't a problem. So I was doing something wrong I suppose:
def rechunk_cache_file_h5(h5_in_fn,h5_out_fn,feattype,chunkshape=(1,
324000)) :
    """rechunks an existing file to enable fast column rather than row
lookups"""
    print 'Working with tables',tables.__file__,'version',tables.__version__
    h5in = tables.openFile(h5_in_fn,'r')
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    arr = h5in.getNode(h5in.root,array_name)
    tbl = h5in.getNode(h5in.root,table_name)
    print 'Opening',h5_out_fn
    h5out = tables.openFile(h5_out_fn,'w')
    print 'Copying table',table_name
    newtbl = tbl.copy(h5out.root,table_name)
    print 'Existing array',array_name,'is
type',type(arr),'shape',arr.shape,'chunkshape',arr.chunkshape
    print 'Copying to',h5_out_fn,'with chunkshape',chunkshape
    newarr = arr.copy(h5out.root,array_name,chunkshape=chunkshape)
    h5in.close()
    h5out.close()

I also tried recreating the original bfeat_mcmc.h5 using approproate
chunksize. This was also slow, though I haven't checked closely how long it
will take. Here's the code I used for that. This code is called in a loop.
The h5 file is opened outside of the loop and the filehandle is passed in as
"h5". Each call of this function should write a matrix of size
[fcount=360,fdim] to the array '%s_array' % feattype and also update a
table which stores indexes.
def write_cache_h5(h5,tid,dat,feattype='sfeat', chunkshape=(1, 324000),
force=False) :
    import tables
    (fcount,fdim)=shape(dat)
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    if h5.root.__contains__(array_name) :
        arr = h5.getNode(h5.root,array_name)
    else:
        filters = tables.Filters(complevel=1, complib='lzo')
        arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(),
(fdim,0), feattype, filters=filters, chunkshape=chunkshape)

    if h5.root.__contains__(table_name):
        tbl = h5.getNode(h5.root,table_name)
    else :
        tbl = h5.createTable('/', table_name,SFeat, table_name)

    idxs = tbl.getWhereList('tid == %i' % tid)
    if len(idxs)>0 :
        if force :
            #don't do anything here. We always look at last record when
multiple are present for a tid
            pass
        else :
            #print 'TID',tid,'is already present in',h5.filename,'but
force=False so not writing data'
            return
    rec = tbl.row
    rec['tid']=tid
    rec['start']=arr.shape[1]
    rec['fcount']=fcount
    rec.append()
    for f in range(fcount) :
        arr.append(dat[f,:].reshape([fdim,1]))
    tbl.flush()
    arr.flush()
class SFeat(tables.IsDescription) :
    tid = tables.Int32Col() #track id
    start = tables.Int32Col() #start idx
    fcount = tables.Int32Col() #number of frames
    def __init__(self) :
        self.tid.createIndex()

Thanks for any help!

bfeat_mcmc.txt (3.87 KB)

···

On Wed, Nov 5, 2008 at 9:02 AM, Francesc Alted <faltet@pytables.com> wrote:

Hi Douglas,

A Tuesday 04 November 2008, Douglas Eck escrigué:

> Hi all,

> I have an hdf file containing this array :

>

> DATASET "bfeat_mcmc_array" {

> DATATYPE H5T_IEEE_F64LE

> DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }

> ATTRIBUTE "CLASS" {

> DATATYPE H5T_STRING {

> STRSIZE 7;

> STRPAD H5T_STR_NULLTERM;

> CSET H5T_CSET_ASCII;

> CTYPE H5T_C_S1;

> }

> DATASPACE SCALAR

> }

>

>

> I need to access it both row-wise and column-wise. I would like to

> store a transposed version (size 6109666 x 360) to make it easier to

> read out a single vector of size 6109666 x 1

>

> What's the best way to do this.

I don't think you need to transpose the dataset: why not just copy it with
a sensible chunksize in destination?. As you are using PyTables, in
forthcoming 2.1 version [1] the .copy() method of leaves will support the
'chunkshape' argument, so this could be done quite easily:

# 'a' is the original 'dim1xdim2' dataset

newchunkshape = (1, a.chunkshape[0]*a.chunkshape[1])

b = a.copy(f.root, "b", chunkshape=newchunkshape)

# 'b' contains the dataset with an optimized chunkshape for reading rows

And that's all. As I was curious about the improvements on this approach, I
have created a small benchmark (attached) and here are the results for your
dataset:

================================

Chunkshape for original array: (360, 45)

Time to append 6109666 rows: 279.564 sec (60.0 MB/s)

Time to read ten rows in original array: 945.315 sec (0.5 MB/s)

================================

Chunkshape for row-wise chunkshape array: (1, 16200)

Time to copy the original array: 611.177 sec (27.5 MB/s)

Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)

================================

Speed-up with a row-wise chunkshape: 27.9

Mmh, it seems like I'm not getting the most out of my disk (14 MB/s is too
few) here. So, perhaps this is the effect of the large HDF5 hash table for
accessing actual data. So as to confirm this, I choose a chunkshape 10x
larger, up to 1.2 MB (that will make the size of the HDF5 hash table to
decrease). Here it is the new result:

<snip>

Chunkshape for row-wise chunkshape array: (1, 162000)

Time to copy the original array: 379.388 sec (44.2 MB/s)

Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)

================================

Speed-up with a row-wise chunkshape: 111.6

Ok. Now I'm getting a decent performance for the new dataset. It is also
worth noting that the copy speed has been accelerated by a 60% (I'd say that
it is pretty optimal now). I've also tried out bigger chunksizes, but the
performance drops quite a lot. Definitely a chunkshape of (1, 162000),
allowing for more than 100x of speed-up over the original setting, seems
good enough for this case.

Incidentally, you could always do the copy manually:

# 'a' is the original 'dim1xdim2' dataset

b = f.createEArray(f.root, "b", tables.Float64Atom(),

shape = (0, dim2), chunkshape=(1, 162000))

for i in xrange(dim1):

b.append([a[i]])

# 'b' contains the dataset with an optimized chunkshape for reading rows

but this method is much more expensive (perhaps more than 10x) than using
the .copy() method (this is because the I/O is optimized during copy
operations). However, I'd say that the read throughput would be similar.

> I'd love to see some magic utility such as

>

>

> hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

In forthcoming PyTables 2.1 [1] you will be able to do:

$ ptrepack /tmp/test.h5:/a /tmp/test2.h5:/a

$ ptrepack --chunkshape='(1, 162000)' /tmp/test.h5:/a /tmp/test2.h5:/b

and then use the test2.h5 for your purposes.

[1] http://www.pytables.org/download/preliminary/

Hope that helps,

--

Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ok I have some comparison times. If I create an array with
chunkshape=(360,100)
fdim=360
arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(), (fdim,0),
feattype, filters=filters, chunkshape=(360,100))
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using ``zlib``
instead
  % (complib, default_complib), FiltersWarning )
File 1 of 100. Chunkshape (360, 100) processed 2 of size (52, 360) in
0.0122499465942
File 11 of 100. Chunkshape (360, 100) processed 12 of size (41, 360) in
0.0148320198059
File 21 of 100. Chunkshape (360, 100) processed 22 of size (62, 360) in
0.0179829597473
File 31 of 100. Chunkshape (360, 100) processed 32 of size (68, 360) in
0.016970872879
File 41 of 100. Chunkshape (360, 100) processed 42 of size (38, 360) in
0.00971698760986
File 51 of 100. Chunkshape (360, 100) processed 52 of size (59, 360) in
0.0128350257874
File 61 of 100. Chunkshape (360, 100) processed 62 of size (45, 360) in
0.00909495353699
File 71 of 100. Chunkshape (360, 100) processed 72 of size (40, 360) in
0.00970888137817
File 81 of 100. Chunkshape (360, 100) processed 82 of size (38, 360) in
0.00800800323486
File 91 of 100. Chunkshape (360, 100) processed 92 of size (33, 360) in
0.0119268894196
100 Processing /part/02/sans-bkp/sitm/data/features/1000/T101.mp3.h5
Breaking after 100
Total time 1.661028862

On the contrary if I create a file with chunkshape=(1,324000) then the times
are much worse
653 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>cache_feat.py
bfeat_mcmc testing4.h5
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using ``zlib``
instead
  % (complib, default_complib), FiltersWarning )
File 1 of 100. Chunkshape (1, 324000) processed 2 of size (52, 360) in
448.759572029

As a baseline here is the output from bench-chunksize.py
596 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>python
~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1

···

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)

Speed-up with a row-wise chunkshape: 102.9

I seem to be doing something wrong but I can't see what....

Thanks!
Doug Eck

A Thursday 06 November 2008, Douglas Eck escrigué:

Hi Francesc, Forum

First, thank you very much for your help. I now am beginning to
understand chunking, thanks to you!
I am trying your recommendations and having some problems. I am
trying to copy bfeat_mcmc.h5 to bfeat_mcmc_ptrepack.h5 using your
recommended approach. The file bfeat_mcmc.h5 contains a table
bfeat_mcmc_table and an array bfeat_mcmc_array. The table is small
and is not an issue here. The table looks like this:
Existing array bfeat_mcmc_array is type <class
'tables.earray.EArray'> shape (360, 6109666) chunkshape (360, 2)
I don't know why the chunkshape is (360,2) but I now understand that
this is not a good chunkshape.

Maybe you forgot to pass the 'expectedrows' parameter in order to inform
PyTables about the expected size of the EArray (this is critical for
allowing for a decent selection of the chunkshape). See the
bench-chunksize.py and you will see that this is used.

The attache file bfeat_mcmc.txt contains the output from hgdump -H
bfeat_mcmc.h5

I installed v2.1rc1
629 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>ipython
In [1]: import tables
In [2]: tables.__version__
Out[2]: '2.1rc1'

When I run ptrepack I get this error:
638 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>ptrepack
--overwrite-nodes --chunkshape='(1,324000)'
bfeat_mcmc.h5:/bfeat_mcmc_array
bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array
Problems doing the copy from 'bfeat_mcmc.h5:/bfeat_mcmc_array' to
'bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array'
The error was --> <type 'exceptions.TypeError'>: _g_copyWithStats()
got an unexpected keyword argument 'propindexes'
The destination file looks like:
bfeat_mcmc_ptrepack.h5 (File) ''
Last modif.: 'Thu Nov 6 10:05:57 2008'
Object Tree:
/ (RootGroup) ''

Traceback (most recent call last):
  File "/u/eckdoug/share/bin/ptrepack", line 3, in <module>
    main()
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepa
ck.py", line 483, in main
    upgradeflavors=upgradeflavors)
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepa
ck.py", line 140, in copyLeaf
    raise RuntimeError, "Please check that the node names are not
duplicated in destination, and if so, add the --overwrite-nodes flag
if desired." RuntimeError: Please check that the node names are not
duplicated in destination, and if so, add the --overwrite-nodes flag
if desired.

Yeah, I ran into this too. This is a bug that I solved yesterday:

http://www.pytables.org/trac/ticket/195

You can either download the trunk version, or apply the patch in:

http://www.pytables.org/trac/changeset/3893

When I run the following code in python I let it run for ten hours
and still hadn't written even 0.1% of the new array. I checked
memory usage. It wasn't a problem. So I was doing something wrong I
suppose: def
rechunk_cache_file_h5(h5_in_fn,h5_out_fn,feattype,chunkshape=(1,
324000)) :
    """rechunks an existing file to enable fast column rather than
row lookups"""
    print 'Working with
tables',tables.__file__,'version',tables.__version__ h5in =
tables.openFile(h5_in_fn,'r')
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    arr = h5in.getNode(h5in.root,array_name)
    tbl = h5in.getNode(h5in.root,table_name)
    print 'Opening',h5_out_fn
    h5out = tables.openFile(h5_out_fn,'w')
    print 'Copying table',table_name
    newtbl = tbl.copy(h5out.root,table_name)
    print 'Existing array',array_name,'is
type',type(arr),'shape',arr.shape,'chunkshape',arr.chunkshape
    print 'Copying to',h5_out_fn,'with chunkshape',chunkshape
    newarr = arr.copy(h5out.root,array_name,chunkshape=chunkshape)
    h5in.close()
    h5out.close()

I don't see why this have to be slow. Perhaps are you using
compression? Also, which version of the HDF5 library are you using?

I also tried recreating the original bfeat_mcmc.h5 using approproate
chunksize. This was also slow, though I haven't checked closely how
long it will take. Here's the code I used for that. This code is
called in a loop. The h5 file is opened outside of the loop and the
filehandle is passed in as "h5". Each call of this function should
write a matrix of size [fcount=360,fdim] to the array '%s_array' %
feattype and also update a table which stores indexes.
def write_cache_h5(h5,tid,dat,feattype='sfeat', chunkshape=(1,
324000), force=False) :
    import tables
    (fcount,fdim)=shape(dat)
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    if h5.root.__contains__(array_name) :
        arr = h5.getNode(h5.root,array_name)
    else:
        filters = tables.Filters(complevel=1, complib='lzo')
        arr = h5.createEArray(h5.root, array_name,
tables.FloatAtom(), (fdim,0), feattype, filters=filters,
chunkshape=chunkshape)

    if h5.root.__contains__(table_name):
        tbl = h5.getNode(h5.root,table_name)
    else :
        tbl = h5.createTable('/', table_name,SFeat, table_name)

    idxs = tbl.getWhereList('tid == %i' % tid)
    if len(idxs)>0 :
        if force :
            #don't do anything here. We always look at last record
when multiple are present for a tid
            pass
        else :
            #print 'TID',tid,'is already present in',h5.filename,'but
force=False so not writing data'
            return
    rec = tbl.row
    rec['tid']=tid
    rec['start']=arr.shape[1]
    rec['fcount']=fcount
    rec.append()
    for f in range(fcount) :
        arr.append(dat[f,:].reshape([fdim,1]))
    tbl.flush()
    arr.flush()
class SFeat(tables.IsDescription) :
    tid = tables.Int32Col() #track id
    start = tables.Int32Col() #start idx
    fcount = tables.Int32Col() #number of frames
    def __init__(self) :
        self.tid.createIndex()

Again, I don't see why this would be slow. Could you check whether you
can run my bench-chunksize.py at decent speeds in your system?

Cheers,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Maybe you forgot to pass the 'expectedrows' parameter in order to inform
PyTables about the expected size of the EArray (this is critical for
allowing for a decent selection of the chunkshape). See the
bench-chunksize.py and you will see that this is used.

I tried with expectedrows and got similar results.

Yeah, I ran into this too. This is a bug that I solved yesterday:

http://www.pytables.org/trac/ticket/195

Great! I'll grab source from trunk.

I don't see why this have to be slow. Perhaps are you using

compression? Also, which version of the HDF5 library are you using?

I am using compression. I thought that was a good thing (?) Learning more
and more....

Again, I don't see why this would be slow. Could you check whether you

can run my bench-chunksize.py at decent speeds in your system?

596 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>python

~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1

···

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)

Speed-up with a row-wise chunkshape: 102.9

Turning off compression helps immensely. I will generate the full file with
recommended chunkshape and get back later with some details for the list.

Thanks!
Doug

Dr. Douglas Eck, Associate Professor
Université de Montréal, Department of Computer Science / BRAMS
CP 6128, Succ. Centre-Ville Montréal, Québec H3C 3J7 CANADA
Office: 3253 Pavillion Andre-Aisenstadt
Phone: 1-514-343-6111 ext 3520 Fax: 1-514-343-5834
http://www.iro.umontreal.ca/~eckdoug
Research Areas: Machine Learning and Music Cognition

···

On Thu, Nov 6, 2008 at 11:52 AM, Douglas Eck <eckdoug@iro.umontreal.ca>wrote:

Maybe you forgot to pass the 'expectedrows' parameter in order to inform

PyTables about the expected size of the EArray (this is critical for
allowing for a decent selection of the chunkshape). See the
bench-chunksize.py and you will see that this is used.

I tried with expectedrows and got similar results.

Yeah, I ran into this too. This is a bug that I solved yesterday:

http://www.pytables.org/trac/ticket/195

Great! I'll grab source from trunk.

I don't see why this have to be slow. Perhaps are you using

compression? Also, which version of the HDF5 library are you using?

I am using compression. I thought that was a good thing (?) Learning more
and more....

Again, I don't see why this would be slow. Could you check whether you

can run my bench-chunksize.py at decent speeds in your system?

596 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>python

~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1

Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)

Speed-up with a row-wise chunkshape: 102.9

A Thursday 06 November 2008, Douglas Eck escrigué:

Ok I have some comparison times. If I create an array with
chunkshape=(360,100)
fdim=360
arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(),
(fdim,0), feattype, filters=filters, chunkshape=(360,100))
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using
``zlib`` instead
  % (complib, default_complib), FiltersWarning )

Aha. So you were using compression.

File 1 of 100. Chunkshape (360, 100) processed 2 of size (52, 360)
in 0.0122499465942
File 11 of 100. Chunkshape (360, 100) processed 12 of size (41, 360)
in 0.0148320198059
File 21 of 100. Chunkshape (360, 100) processed 22 of size (62, 360)
in 0.0179829597473
File 31 of 100. Chunkshape (360, 100) processed 32 of size (68, 360)
in 0.016970872879
File 41 of 100. Chunkshape (360, 100) processed 42 of size (38, 360)
in 0.00971698760986
File 51 of 100. Chunkshape (360, 100) processed 52 of size (59, 360)
in 0.0128350257874
File 61 of 100. Chunkshape (360, 100) processed 62 of size (45, 360)
in 0.00909495353699
File 71 of 100. Chunkshape (360, 100) processed 72 of size (40, 360)
in 0.00970888137817
File 81 of 100. Chunkshape (360, 100) processed 82 of size (38, 360)
in 0.00800800323486
File 91 of 100. Chunkshape (360, 100) processed 92 of size (33, 360)
in 0.0119268894196
100 Processing /part/02/sans-bkp/sitm/data/features/1000/T101.mp3.h5
Breaking after 100
Total time 1.661028862

On the contrary if I create a file with chunkshape=(1,324000) then
the times are much worse
653 eckdoug@cerveau
/part/02/sans-bkp/sitm/data/features>cache_feat.py bfeat_mcmc
testing4.h5
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using
``zlib`` instead
  % (complib, default_complib), FiltersWarning )
File 1 of 100. Chunkshape (1, 324000) processed 2 of size (52, 360)
in 448.759572029

Well, I'd say that the problem is the compression (all my previous
benchmarks were made without compression). More specifically, if
compression is used in PyTables, the 'shuffle' filter is activated, and
this filter is *extremely expensive* when you have very large
chunkshapes (as is your case). Try deactivating it with:

Filters(complevel=1, complib='zlib', shuffle=False)

an re-run your tests. If you still notice that 'zlib' is too slow, you
may want to use the 'lzo' compressor (but before, you should install it
in your system so that PyTables can use it), which is far more faster,
specially during the compression phase (decompression is also faster
than 'zlib', but the difference is not so large, specially on modern
processors).

If speed is not good yet, then you should try to suppress compression
completely (Filters(complevel=0)), which is the default.

As a baseline here is the output from bench-chunksize.py
596 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>python
~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1

Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)

Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)

Speed-up with a row-wise chunkshape: 102.9

Yeah, seems good (the read speed in the case of the row-wise chunkshape
can be increased if you set a larger chunkshape, as I pointed out in a
previous message, but your system seems ok).

Cheers,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Thursday 06 November 2008, Douglas Eck escrigué:

Turning off compression helps immensely.

I don't think you need to turn off compression completely, but just the
shuffle filter. In particular, lzo compressor should help you in
getting better performance yet (unless its performance degrades
severely with such a huge chunksizes as you are using).

I will generate the full
file with recommended chunkshape and get back later with some details
for the list.

Great.

···

Thanks!
Doug

Dr. Douglas Eck, Associate Professor
Université de Montréal, Department of Computer Science / BRAMS
CP 6128, Succ. Centre-Ville Montréal, Québec H3C 3J7 CANADA
Office: 3253 Pavillion Andre-Aisenstadt
Phone: 1-514-343-6111 ext 3520 Fax: 1-514-343-5834
http://www.iro.umontreal.ca/~eckdoug
Research Areas: Machine Learning and Music Cognition

On Thu, Nov 6, 2008 at 11:52 AM, Douglas Eck <eckdoug@iro.umontreal.ca>wrote:
> Maybe you forgot to pass the 'expectedrows' parameter in order to
> inform
>
>> PyTables about the expected size of the EArray (this is critical
>> for allowing for a decent selection of the chunkshape). See the
>> bench-chunksize.py and you will see that this is used.
>
> I tried with expectedrows and got similar results.
>
>> Yeah, I ran into this too. This is a bug that I solved yesterday:
>>
>> http://www.pytables.org/trac/ticket/195
>
> Great! I'll grab source from trunk.
>
> I don't see why this have to be slow. Perhaps are you using
>
>> compression? Also, which version of the HDF5 library are you
>> using?
>
> I am using compression. I thought that was a good thing (?)
> Learning more and more....
>
>
> Again, I don't see why this would be slow. Could you check whether
> you
>
>> can run my bench-chunksize.py at decent speeds in your system?
>>
>> 596 eckdoug@cerveau /part/02/sans-bkp/sitm/data/features>python
>
> ~/test/bench-chunksize.py
> Using tables from /u/eckdoug/test/tables/__init__.pyc version
> 2.1rc1 ================================
> Chunkshape for original array: (360, 45)
> Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
> Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
> ================================
> Chunkshape for row-wise chunkshape array: (1, 16200)
> Time to copy the original array: 1253.463 sec (13.4 MB/s)
> Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
> ================================
> Speed-up with a row-wise chunkshape: 102.9

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.