Reading across multiple chunks is very slow

Hi,

A PyTables' user has reported a performance problem when reading a
dataset in some cases. I've tracked down the problem to the HDF5
library as the output of the attached script reveals:

Time for creating dataset with dims {3, 1978, 1556, 288} --> 0.000000
Time for writing hyperslice {2, 1978, 1556, 2} --> 12.010000
Time for reading hyperslice {2, 1978, 1556, 1} --> 0.020000
Time for reading hyperslice {1, 1978, 1556, 2} --> 2.490000

[This dataset has a chunksize of: {1, 1978, 1556, 1}]

The problem is: why it took 100x times more to read a hyperslice with a
count of {1, 1978, 1556, 2} than other with count {2, 1978, 1556, 1}?

I was trying to figure out what's happening, but as I can't realize a
clear explanation, I think that perhaps this is a bug in HDF5. I've
tried with HDF5 1.6.5, 1.8.2 and 1.8.2-post8, all with similar results.

Thanks,

read-performance-problem.c (3.82 KB)

···

--
Francesc Alted

Hi Fransesc,

The only thing I can think of is that reading [1,1978,1556,2] requires much more data shuffling. Effectively the 2 tiles have to be interleaved. In the other case the 2 tiles just have to be concatenated. But I doubt if that costs so much more time.
Do you have the amount of user and system time it took?

Cheers,
Ger

Francesc Alted <faltet@pytables.org> 03/19/09 7:35 PM >>>

Hi,

A PyTables' user has reported a performance problem when reading a
dataset in some cases. I've tracked down the problem to the HDF5
library as the output of the attached script reveals:

Time for creating dataset with dims {3, 1978, 1556, 288} --> 0.000000
Time for writing hyperslice {2, 1978, 1556, 2} --> 12.010000
Time for reading hyperslice {2, 1978, 1556, 1} --> 0.020000
Time for reading hyperslice {1, 1978, 1556, 2} --> 2.490000

[This dataset has a chunksize of: {1, 1978, 1556, 1}]

The problem is: why it took 100x times more to read a hyperslice with a
count of {1, 1978, 1556, 2} than other with count {2, 1978, 1556, 1}?

I was trying to figure out what's happening, but as I can't realize a
clear explanation, I think that perhaps this is a bug in HDF5. I've
tried with HDF5 1.6.5, 1.8.2 and 1.8.2-post8, all with similar results.

Thanks,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Francesc,

Noted. We will take a look. Thanks for reporting!

Elena

···

On Mar 19, 2009, at 1:35 PM, Francesc Alted wrote:

Hi,

A PyTables' user has reported a performance problem when reading a
dataset in some cases. I've tracked down the problem to the HDF5
library as the output of the attached script reveals:

Time for creating dataset with dims {3, 1978, 1556, 288} --> 0.000000
Time for writing hyperslice {2, 1978, 1556, 2} --> 12.010000
Time for reading hyperslice {2, 1978, 1556, 1} --> 0.020000
Time for reading hyperslice {1, 1978, 1556, 2} --> 2.490000

[This dataset has a chunksize of: {1, 1978, 1556, 1}]

The problem is: why it took 100x times more to read a hyperslice with a
count of {1, 1978, 1556, 2} than other with count {2, 1978, 1556, 1}?

I was trying to figure out what's happening, but as I can't realize a
clear explanation, I think that perhaps this is a bug in HDF5. I've
tried with HDF5 1.6.5, 1.8.2 and 1.8.2-post8, all with similar results.

Thanks,

--
Francesc Alted
<read-performance-problem.c>----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Friday 20 March 2009, Ger van Diepen escrigué:

Hi Fransesc,

The only thing I can think of is that reading [1,1978,1556,2]
requires much more data shuffling. Effectively the 2 tiles have to be
interleaved. In the other case the 2 tiles just have to be
concatenated. But I doubt if that costs so much more time. Do you
have the amount of user and system time it took?

Here you have:

time for two leading indices [2,1978,1556,1] --> 0.0165
real 0m0.128s
user 0m0.076s
sys 0m0.052s

time for two trailing indices [1,1978,1556,2] --> 2.529
real 0m2.642s
user 0m0.868s
sys 0m1.764s

So, yeah, it seems that the second selection takes much more user, but
specially, system time. However, reading trailing indices one by one
seems much faster:

time for one trailing index (offset=0, count=1) --> 0.008329
real 0m0.122s
user 0m0.076s
sys 0m0.048s

time for one trailing index (offset=1, count=1) --> 0.00888
real 0m0.123s
user 0m0.092s
sys 0m0.028s

So, it definitely seems that something is suboptimal in HDF5 for this
use case.

Thanks,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Sunday 22 March 2009, Elena Pourmal escrigué:

Hi Francesc,

Noted. We will take a look. Thanks for reporting!

Thanks Elena. By the way, this observation that Andrew Collette send to
me privately (by error, I suppose) is also pertinent.

---------- Missatge transmès ----------

···

Subject: Re: [hdf-forum] Reading across multiple chunks is very slow
Date: Thursday 19 March 2009
From: Andrew Collette <andrew.collette@gmail.com>
To: Francesc Alted <faltet@pytables.org>

This also seems pathologically slow; 12 seconds to write < 50MB? For
another data point, I tried changing the dimensions to:

{3, 288, 1978, 1556}

(and corresponding changes to the chunk sizes), and the whole program
now takes less than a second to run.

Andrew Collette

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

You could use strace to see which system calls are being done.
I used numpy to interleave 2 arrays in the way HDF5 has to do it, but
that is more or less instantaneous.
So I agree that HDF5 must be doing something bad.

Ger

Francesc Alted <faltet@pytables.org> 03/20/09 11:25 AM >>>

A Friday 20 March 2009, Ger van Diepen escrigué:

Hi Fransesc,

The only thing I can think of is that reading [1,1978,1556,2]
requires much more data shuffling. Effectively the 2 tiles have to

be

interleaved. In the other case the 2 tiles just have to be
concatenated. But I doubt if that costs so much more time. Do you
have the amount of user and system time it took?

Here you have:

time for two leading indices [2,1978,1556,1] --> 0.0165
real 0m0.128s
user 0m0.076s
sys 0m0.052s

time for two trailing indices [1,1978,1556,2] --> 2.529
real 0m2.642s
user 0m0.868s
sys 0m1.764s

So, yeah, it seems that the second selection takes much more user, but

specially, system time. However, reading trailing indices one by one
seems much faster:

time for one trailing index (offset=0, count=1) --> 0.008329
real 0m0.122s
user 0m0.076s
sys 0m0.048s

time for one trailing index (offset=1, count=1) --> 0.00888
real 0m0.123s
user 0m0.092s
sys 0m0.028s

So, it definitely seems that something is suboptimal in HDF5 for this
use case.

Thanks,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc,

Francesc Alted wrote:

A Friday 20 March 2009, Ger van Diepen escrigu�:
  

Hi Fransesc,

The only thing I can think of is that reading [1,1978,1556,2]
requires much more data shuffling. Effectively the 2 tiles have to be
interleaved. In the other case the 2 tiles just have to be
concatenated. But I doubt if that costs so much more time. Do you
have the amount of user and system time it took?
    
Here you have:

time for two leading indices [2,1978,1556,1] --> 0.0165
real 0m0.128s
user 0m0.076s
sys 0m0.052s

time for two trailing indices [1,1978,1556,2] --> 2.529
real 0m2.642s
user 0m0.868s
sys 0m1.764s

So, yeah, it seems that the second selection takes much more user, but specially, system time. However, reading trailing indices one by one seems much faster:

time for one trailing index (offset=0, count=1) --> 0.008329
real 0m0.122s
user 0m0.076s
sys 0m0.048s

time for one trailing index (offset=1, count=1) --> 0.00888
real 0m0.123s
user 0m0.092s
sys 0m0.028s

So, it definitely seems that something is suboptimal in HDF5 for this use case.

Thanks,
  
This is happening because (in the "time for two trailing indices" case) the individual chunks are not contiguous in memory, as Ger pointed out. Also, because the chunks are larger than the chunk cache size (default=1 MB), the library makes a best effort to avoid having to allocate enough memory in the chunk. Therefore it reads directly from the file into the supplied read buffer. Because the selection in the read buffer (for each chunk) is a series of small non-contiguous blocks, the library must make a large number of small reads.

To improve performance, you can increase the chunk cache size with H5Pset_cache (or the new H5Pset_chunk_cache function if you're using the latest snapshot). The test runs in about .7 seconds with this change on my laptop, down from ~30 seconds. This is still more time than for the contiguous case, because the library must allocate the extra space and scatter each element individually from the cache to the read buffer, but now only calls read once for each chunk.

I hope this is helpful,
-Neil

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Neil,

A Tuesday 31 March 2009, Neil Fortner escrigué:

This is happening because (in the "time for two trailing indices"
case) the individual chunks are not contiguous in memory, as Ger
pointed out. Also, because the chunks are larger than the chunk cache
size (default=1 MB), the library makes a best effort to avoid having
to allocate enough memory in the chunk. Therefore it reads directly
from the file into the supplied read buffer. Because the selection
in the read buffer (for each chunk) is a series of small
non-contiguous blocks, the library must make a large number of small
reads.

I see. That makes sense.

To improve performance, you can increase the chunk cache size with
H5Pset_cache (or the new H5Pset_chunk_cache function if you're using
the latest snapshot). The test runs in about .7 seconds with this
change on my laptop, down from ~30 seconds. This is still more time
than for the contiguous case, because the library must allocate the
extra space and scatter each element individually from the cache to
the read buffer, but now only calls read once for each chunk.

Yes, it works a lot better now! However, after setting the chunk cache
size to 12 MB (a bit larger than my chunk size, which is 11.8 MB) the
performance is still a long way to be optimal, IMHO. Look at this
numbers:

For a default chunk cache size (1 MB):

time for [0:2,:,:,0] --> 0.15
time for [0,:,:,0:2] --> 7.615

With an increased chunk cache size (12 MB):

time for [0:2,:,:,0] --> 0.165
time for [0,:,:,0:2] --> 1.312

So, despite that the new time is around 6x better, it is still almost 8x
slower than the contiguous case. In order to simulate the time that
could take to scatter each element from the cache to the read buffer,
I've computed the time that takes a similar process with NumPy:

In [28]: a = np.arange(1978*1556*2, dtype="int32")

In [29]: b = np.empty(1978*1556*2, dtype="int32")

In [30]: timeit b[:b.size/2] = a[::2]
10 loops, best of 3: 40.7 ms per loop

In [31]: timeit b[b.size/2:] = a[1::2]
10 loops, best of 3: 40.1 ms per loop

In [32]: a
Out[32]: array([ 0, 1, 2, ..., 6155533, 6155534,
6155535])

In [33]: b
Out[33]: array([ 0, 2, 4, ..., 6155531, 6155533,
6155535])

As can be seen, the scatter process completes in around 40 + 40 = 80 ms
for my chunk size. Hence, I'd expect the second read case to complete
in around 0.17 + 0.08 = 0.25 seconds, while the actual time is still 5x
slower. So, unless I'm missing something, my guess is that the scatter
code in the HDF5 library could be made a lot faster.

Thanks,

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the fightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
   "On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Fancesc,

Francesc Alted wrote:

Hi Neil,

A Tuesday 31 March 2009, Neil Fortner escrigu�:
  

This is happening because (in the "time for two trailing indices"
case) the individual chunks are not contiguous in memory, as Ger
pointed out. Also, because the chunks are larger than the chunk cache
size (default=1 MB), the library makes a best effort to avoid having
to allocate enough memory in the chunk. Therefore it reads directly
from the file into the supplied read buffer. Because the selection
in the read buffer (for each chunk) is a series of small
non-contiguous blocks, the library must make a large number of small
reads.
    
I see. That makes sense.

To improve performance, you can increase the chunk cache size with
H5Pset_cache (or the new H5Pset_chunk_cache function if you're using
the latest snapshot). The test runs in about .7 seconds with this
change on my laptop, down from ~30 seconds. This is still more time
than for the contiguous case, because the library must allocate the
extra space and scatter each element individually from the cache to
the read buffer, but now only calls read once for each chunk.
    
Yes, it works a lot better now! However, after setting the chunk cache size to 12 MB (a bit larger than my chunk size, which is 11.8 MB) the performance is still a long way to be optimal, IMHO. Look at this numbers:

For a default chunk cache size (1 MB):

time for [0:2,:,:,0] --> 0.15
time for [0,:,:,0:2] --> 7.615

With an increased chunk cache size (12 MB):

time for [0:2,:,:,0] --> 0.165
time for [0,:,:,0:2] --> 1.312

So, despite that the new time is around 6x better, it is still almost 8x slower than the contiguous case. In order to simulate the time that could take to scatter each element from the cache to the read buffer, I've computed the time that takes a similar process with NumPy:

In [28]: a = np.arange(1978*1556*2, dtype="int32")

In [29]: b = np.empty(1978*1556*2, dtype="int32")

In [30]: timeit b[:b.size/2] = a[::2]
10 loops, best of 3: 40.7 ms per loop

In [31]: timeit b[b.size/2:] = a[1::2]
10 loops, best of 3: 40.1 ms per loop

In [32]: a
Out[32]: array([ 0, 1, 2, ..., 6155533, 6155534, 6155535])

In [33]: b
Out[33]: array([ 0, 2, 4, ..., 6155531, 6155533, 6155535])

As can be seen, the scatter process completes in around 40 + 40 = 80 ms for my chunk size. Hence, I'd expect the second read case to complete in around 0.17 + 0.08 = 0.25 seconds, while the actual time is still 5x slower. So, unless I'm missing something, my guess is that the scatter code in the HDF5 library could be made a lot faster.

Thanks,

Thanks for the investigation. The routines currently used for partial I/O are very general and there is a substantial amount of overhead when the sequences are very short (4 bytes in this case). I have filed an enhancement bug to add optimized routines to handle this case, and hopefully we will get to this soon.

The optimized code would be used whenever:
- There is at most one hyperslab selection in the source and destination (and no points), and:
- Either:
    - The selections in the source and destination have the same dimensions (but not necessarily offsets), or:
    - The selection in either the source or destination is contiguous (in the serialized form)

For chunked datasets this would of course be applied on a per-chunk basis. Does this seem like it would fit most of your use cases?

Thanks,
-Neil

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Wednesday 01 April 2009, Neil Fortner escrigué:

Fancesc,

Francesc Alted wrote:
> Hi Neil,
>
> A Tuesday 31 March 2009, Neil Fortner escrigué:
>> This is happening because (in the "time for two trailing indices"
>> case) the individual chunks are not contiguous in memory, as Ger
>> pointed out. Also, because the chunks are larger than the chunk
>> cache size (default=1 MB), the library makes a best effort to
>> avoid having to allocate enough memory in the chunk. Therefore it
>> reads directly from the file into the supplied read buffer.
>> Because the selection in the read buffer (for each chunk) is a
>> series of small
>> non-contiguous blocks, the library must make a large number of
>> small reads.
>
> I see. That makes sense.
>
>> To improve performance, you can increase the chunk cache size with
>> H5Pset_cache (or the new H5Pset_chunk_cache function if you're
>> using the latest snapshot). The test runs in about .7 seconds
>> with this change on my laptop, down from ~30 seconds. This is
>> still more time than for the contiguous case, because the library
>> must allocate the extra space and scatter each element
>> individually from the cache to the read buffer, but now only calls
>> read once for each chunk.
>
> Yes, it works a lot better now! However, after setting the chunk
> cache size to 12 MB (a bit larger than my chunk size, which is 11.8
> MB) the performance is still a long way to be optimal, IMHO. Look
> at this numbers:
>
> For a default chunk cache size (1 MB):
>
> time for [0:2,:,:,0] --> 0.15
> time for [0,:,:,0:2] --> 7.615
>
> With an increased chunk cache size (12 MB):
>
> time for [0:2,:,:,0] --> 0.165
> time for [0,:,:,0:2] --> 1.312
>
> So, despite that the new time is around 6x better, it is still
> almost 8x slower than the contiguous case. In order to simulate
> the time that could take to scatter each element from the cache to
> the read buffer, I've computed the time that takes a similar
> process with NumPy:
>
> In [28]: a = np.arange(1978*1556*2, dtype="int32")
>
> In [29]: b = np.empty(1978*1556*2, dtype="int32")
>
> In [30]: timeit b[:b.size/2] = a[::2]
> 10 loops, best of 3: 40.7 ms per loop
>
> In [31]: timeit b[b.size/2:] = a[1::2]
> 10 loops, best of 3: 40.1 ms per loop
>
> In [32]: a
> Out[32]: array([ 0, 1, 2, ..., 6155533, 6155534,
> 6155535])
>
> In [33]: b
> Out[33]: array([ 0, 2, 4, ..., 6155531, 6155533,
> 6155535])
>
> As can be seen, the scatter process completes in around 40 + 40 =
> 80 ms for my chunk size. Hence, I'd expect the second read case to
> complete in around 0.17 + 0.08 = 0.25 seconds, while the actual
> time is still 5x slower. So, unless I'm missing something, my
> guess is that the scatter code in the HDF5 library could be made a
> lot faster.
>
> Thanks,

Thanks for the investigation. The routines currently used for
partial I/O are very general and there is a substantial amount of
overhead when the sequences are very short (4 bytes in this case). I
have filed an enhancement bug to add optimized routines to handle
this case, and hopefully we will get to this soon.

The optimized code would be used whenever:
- There is at most one hyperslab selection in the source and
destination (and no points), and:
- Either:
    - The selections in the source and destination have the same
dimensions (but not necessarily offsets), or:
    - The selection in either the source or destination is contiguous
(in the serialized form)

For chunked datasets this would of course be applied on a per-chunk
basis. Does this seem like it would fit most of your use cases?

I think so, yes. These requirements would cover a very important use
case. In my experience, when one ask for performance, accessing data
with the above restrictions is very reasonable and common practice.

Thanks!

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the fightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
   "On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.