Writing to a dataset with 'wrong' chunksize

Francesc_Altet · November 23, 2007, 8:06pm

Hi,

Some time ago, a Pytables user complained about that the next simple
operation was hogging gigantics amounts of memory:

import tables, numpy
N = 600
f = tables.openFile('foo.h5', 'w')
f.createCArray(f.root, 'huge_array',
               tables.Float64Atom(),
               shape = (2,2,N,N,50,50))
for i in xrange(50):
    for j in xrange(50):
        f.root.huge_array[:,:,:,:,j,i] = \
            numpy.array([[1,0],[0,1]])[:,:,None,None]

and I think that the problem could be in the HDF5 side.

The point is that, for the 6-th dimensional 'huge_array' dataset,
Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
Then, the user wanted to update the array starting in the trailing
dimensions (instead of using the leading ones, which is the recommended
practice for C-ordered arrays). This results in Pytables asking HDF5
to do the update using the traditional procedure:

/* Create a simple memory data space */
if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
return -3;

/* Get the file data space */
if ( (space_id = H5Dget_space( dataset_id )) < 0 )
return -4;

/* Define a hyperslab in the dataset */
if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET, start,
step, count, NULL) < 0 )
return -5;

if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
H5P_DEFAULT, data ) < 0 )
return -6;

While I understand that this approach is suboptimal (2*2*600*100=240000
chunks has to 'updated' for each update operation in the loop), I don't
understand completely the reason why the user reports that the script
is consuming so much memory (the script crashes, but perhaps it is
asking for several GB). My guess is that perhaps HDF5 is trying to
load all the affected chunks in-memory before trying to update them,
but I thought it is best to report this here just in case this is a
bug, or, if not, the huge demand of memory can be somewhat alleviated.

In case you need more information, you may find it by following the
details of the discussion in the next thread:

http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg00722.html

Thanks!

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · November 27, 2007, 12:39pm

Hi Francesc,

Hi,

Some time ago, a Pytables user complained about that the next simple
operation was hogging gigantics amounts of memory:

import tables, numpy
N = 600
f = tables.openFile('foo.h5', 'w')
f.createCArray(f.root, 'huge_array',
               tables.Float64Atom(),
               shape = (2,2,N,N,50,50))
for i in xrange(50):
    for j in xrange(50):
        f.root.huge_array[:,:,:,:,j,i] = \
            numpy.array([[1,0],[0,1]])[:,:,None,None]

and I think that the problem could be in the HDF5 side.

The point is that, for the 6-th dimensional 'huge_array' dataset,
Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
Then, the user wanted to update the array starting in the trailing
dimensions (instead of using the leading ones, which is the recommended
practice for C-ordered arrays). This results in Pytables asking HDF5
to do the update using the traditional procedure:

/* Create a simple memory data space */
if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
   return -3;

/* Get the file data space */
if ( (space_id = H5Dget_space( dataset_id )) < 0 )
  return -4;

/* Define a hyperslab in the dataset */
if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET, start,
          step, count, NULL) < 0 )
  return -5;

if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
H5P_DEFAULT, data ) < 0 )
   return -6;

While I understand that this approach is suboptimal (2*2*600*100=240000
chunks has to 'updated' for each update operation in the loop), I don't
understand completely the reason why the user reports that the script
is consuming so much memory (the script crashes, but perhaps it is
asking for several GB). My guess is that perhaps HDF5 is trying to
load all the affected chunks in-memory before trying to update them,
but I thought it is best to report this here just in case this is a
bug, or, if not, the huge demand of memory can be somewhat alleviated.

Is this with the 1.6.x library code? If so, it would be worthwhile checking with the 1.8.0 code, which is designed to do all the I/O on each chunk at once and then proceed to the next chunk. However, it does build information about the selection on each chunk to update and if the I/O operation will update 240,000 chunks, that could be a large amount of memory...

Quincey

···

On Nov 23, 2007, at 2:06 PM, Francesc Altet wrote:

In case you need more information, you may find it by following the
details of the discussion in the next thread:

http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg00722.html

Thanks!

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dominik_Szczerba · November 27, 2007, 12:56pm

So what should I chose to have the same default behavior as saying "gzip
file.dat" on command line? I seek a reasonable default for my small library
and a user can finetune later etc.
thanks,
Dominik

···

On Tuesday 27 November 2007 13.39:51 Quincey Koziol wrote:

Hi Francesc,

On Nov 23, 2007, at 2:06 PM, Francesc Altet wrote:
> Hi,
>
> Some time ago, a Pytables user complained about that the next simple
> operation was hogging gigantics amounts of memory:
>
> import tables, numpy
> N = 600
> f = tables.openFile('foo.h5', 'w')
> f.createCArray(f.root, 'huge_array',
> tables.Float64Atom(),
> shape = (2,2,N,N,50,50))
> for i in xrange(50):
> for j in xrange(50):
> f.root.huge_array[:,:,:,:,j,i] = \
> numpy.array([[1,0],[0,1]])[:,:,None,None]
>
> and I think that the problem could be in the HDF5 side.
>
> The point is that, for the 6-th dimensional 'huge_array' dataset,
> Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
> Then, the user wanted to update the array starting in the trailing
> dimensions (instead of using the leading ones, which is the
> recommended
> practice for C-ordered arrays). This results in Pytables asking HDF5
> to do the update using the traditional procedure:
>
> /* Create a simple memory data space */
> if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
> return -3;
>
> /* Get the file data space */
> if ( (space_id = H5Dget_space( dataset_id )) < 0 )
> return -4;
>
> /* Define a hyperslab in the dataset */
> if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET,
> start,
> step, count, NULL) < 0 )
> return -5;
>
> if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
> H5P_DEFAULT, data ) < 0 )
> return -6;
>
> While I understand that this approach is suboptimal
> (2*2*600*100=240000
> chunks has to 'updated' for each update operation in the loop), I
> don't
> understand completely the reason why the user reports that the script
> is consuming so much memory (the script crashes, but perhaps it is
> asking for several GB). My guess is that perhaps HDF5 is trying to
> load all the affected chunks in-memory before trying to update them,
> but I thought it is best to report this here just in case this is a
> bug, or, if not, the huge demand of memory can be somewhat alleviated.

Is this with the 1.6.x library code? If so, it would be worthwhile
checking with the 1.8.0 code, which is designed to do all the I/O on
each chunk at once and then proceed to the next chunk. However, it
does build information about the selection on each chunk to update
and if the I/O operation will update 240,000 chunks, that could be a
large amount of memory...

Quincey

> In case you need more information, you may find it by following the
> details of the discussion in the next thread:
>
> http://www.mail-archive.com/pytables-users@lists.sourceforge.net/
> msg00722.html
>
> Thanks!
>
> --
>
>> 0,0< Francesc Altet http://www.carabos.com/
>
> V V Cárabos Coop. V. Enjoy Data
> "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-
> subscribe@hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org. To unsubscribe, send a message to
hdf-forum-unsubscribe@hdfgroup.org.

--
Dominik Szczerba, Ph.D.
Computer Vision Lab CH-8092 Zurich
http://www.vision.ee.ethz.ch/~domi

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Altet · November 28, 2007, 4:44pm

Hi Quincey,

A Tuesday 27 November 2007, Quincey Koziol escrigué:
[snip]

> While I understand that this approach is suboptimal
> (2*2*600*100=240000
> chunks has to 'updated' for each update operation in the loop), I
> don't
> understand completely the reason why the user reports that the
> script is consuming so much memory (the script crashes, but perhaps
> it is asking for several GB). My guess is that perhaps HDF5 is
> trying to load all the affected chunks in-memory before trying to
> update them, but I thought it is best to report this here just in
> case this is a bug, or, if not, the huge demand of memory can be
> somewhat alleviated.

Is this with the 1.6.x library code? If so, it would be worthwhile
checking with the 1.8.0 code, which is designed to do all the I/O on
each chunk at once and then proceed to the next chunk. However, it
does build information about the selection on each chunk to update
and if the I/O operation will update 240,000 chunks, that could be a
large amount of memory...

Yes, this was using 1.6.x library. I've directed the user to compile
PyTables with the latest 1.8.0 (beta5) library (with
the "--with-default-api-version=v16" flag) but he is reporting
problems. Here is the relevant excerpt of the traceback:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210186064 (LWP 12304)]
0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
464 H5S_SELECT_RELEASE(ds);
(gdb) bt
#0 0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
#1 0xb7a0ab4e in H5D_destroy_chunk_map (fm=0xbfb0fff8) at H5Dio.c:2651
#2 0xb7a0b04c in H5D_create_chunk_map (fm=0xbfb0fff8,
    io_info=<value optimized out>, nelmts=1440000, file_space=0x84bd140,
    mem_space=0x84b40f0, mem_type=0x8363000) at H5Dio.c:2556
#3 0xb7a0cd1a in H5D_chunk_write (io_info=0xbfb13c24, nelmts=1440000,
    mem_type=0x8363000, mem_space=0x84b40f0, file_space=0x84bd140,
    tpath=0x8363e30, src_id=50331970, dst_id=50331966, buf=0xb57b8008)
    at H5Dio.c:1765
#4 0xb7a106f9 in H5D_write (dataset=0x840a418, mem_type_id=50331970,
    mem_space=0x84b40f0, file_space=0x84bd140, dxpl_id=167772168,
    buf=0xb57b8008) at H5Dio.c:732
#5 0xb7a117aa in H5Dwrite (dset_id=83886080, mem_type_id=50331970,
    mem_space_id=67108874, file_space_id=67108875, plist_id=167772168,
    buf=0xb57b8008) at H5Dio.c:434

We don't have time right now to look into it, but it could be a problem
with PyTables code (although, if the "--with-default-api-version=v16"
flag is working properly, this should not be the case). It is strange,
because PyTables used to work perfectly up to HDF5 1.8.0 beta3 (i.e.
all tests passed).

If we do more progress on this issue, I'll let you know.

Thanks!

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · November 29, 2007, 12:48pm

Hi Francesc,

Hi Quincey,

A Tuesday 27 November 2007, Quincey Koziol escrigué:
[snip]

While I understand that this approach is suboptimal
(2*2*600*100=240000
chunks has to 'updated' for each update operation in the loop), I
don't
understand completely the reason why the user reports that the
script is consuming so much memory (the script crashes, but perhaps
it is asking for several GB). My guess is that perhaps HDF5 is
trying to load all the affected chunks in-memory before trying to
update them, but I thought it is best to report this here just in
case this is a bug, or, if not, the huge demand of memory can be
somewhat alleviated.

  Is this with the 1.6.x library code? If so, it would be worthwhile
checking with the 1.8.0 code, which is designed to do all the I/O on
each chunk at once and then proceed to the next chunk. However, it
does build information about the selection on each chunk to update
and if the I/O operation will update 240,000 chunks, that could be a
large amount of memory...

Yes, this was using 1.6.x library. I've directed the user to compile
PyTables with the latest 1.8.0 (beta5) library (with
the "--with-default-api-version=v16" flag) but he is reporting
problems. Here is the relevant excerpt of the traceback:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210186064 (LWP 12304)]
0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
464 H5S_SELECT_RELEASE(ds);
(gdb) bt
#0 0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
#1 0xb7a0ab4e in H5D_destroy_chunk_map (fm=0xbfb0fff8) at H5Dio.c:2651
#2 0xb7a0b04c in H5D_create_chunk_map (fm=0xbfb0fff8,
    io_info=<value optimized out>, nelmts=1440000, file_space=0x84bd140,
    mem_space=0x84b40f0, mem_type=0x8363000) at H5Dio.c:2556
#3 0xb7a0cd1a in H5D_chunk_write (io_info=0xbfb13c24, nelmts=1440000,
    mem_type=0x8363000, mem_space=0x84b40f0, file_space=0x84bd140,
    tpath=0x8363e30, src_id=50331970, dst_id=50331966, buf=0xb57b8008)
    at H5Dio.c:1765
#4 0xb7a106f9 in H5D_write (dataset=0x840a418, mem_type_id=50331970,
    mem_space=0x84b40f0, file_space=0x84bd140, dxpl_id=167772168,
    buf=0xb57b8008) at H5Dio.c:732
#5 0xb7a117aa in H5Dwrite (dset_id=83886080, mem_type_id=50331970,
    mem_space_id=67108874, file_space_id=67108875, plist_id=167772168,
    buf=0xb57b8008) at H5Dio.c:434

We don't have time right now to look into it, but it could be a problem
with PyTables code (although, if the "--with-default-api-version=v16"
flag is working properly, this should not be the case). It is strange,
because PyTables used to work perfectly up to HDF5 1.8.0 beta3 (i.e.
all tests passed).

Hmm, I have been working on that section of code a lot, it's certainly possible that I've introduced a bug. :-/

If we do more progress on this issue, I'll let you know.

If you can characterize it in a standalone program, that would be really great!

Thanks,
Quincey

···

On Nov 28, 2007, at 10:44 AM, Francesc Altet wrote:

Thanks!

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Altet · December 1, 2007, 5:49pm

Quincey,

A Thursday 29 November 2007, Quincey Koziol escrigué:

Hmm, I have been working on that section of code a lot, it's
certainly possible that I've introduced a bug. :-/

If you can characterize it in a standalone program, that would be
really great!

I've done this in the attached program. It works as it is, but set N to
600 and you will get the segfault using 1.8.0 beta5 (sorry, I'm in a
hurry and don't have time to check other HDF5 versions).

Cheers,

write-bug.c (2.31 KB)

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

Quincey_Koziol · December 4, 2007, 4:09pm

Hi Francesc,

A Monday 03 December 2007, Francesc Altet escrigué:

Ups, I've ended with a similar program and send it to the
hdf-forum@hdfgroup.org list past Saturday. I'm attaching my own
version (which is pretty similar to yours). Sorry for not sending
you a copy of my previous message, because it could saved you some
work :-/

Well, as Ivan pointed out, a couple of glitches slipped in my program.
I'm attaching the correct version, but the result is the same, i.e.
when N=600. I'm getting a segfault both under HDF5 1.6.5 and 1.8.0
beta5.

OK, I'll try to duplicate the bug here and fix it today.

Thanks,
Quincey

···

On Dec 3, 2007, at 11:21 AM, Francesc Altet wrote:

Cheers,

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"
<write-bug2.c>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.