Suggestion for filters and efficiency

Francesc_Alted2 · April 30, 2009, 4:13pm

Hi,

I'm experimenting with very simple but fast compression algorithms with
the aim to do a third-party filter in HDF5. During my implementation
it seems like that the additional copy that requires HDF5 in order to
tranfer the data from the filter output buffer to the final destination
data area is going to have some impact in the final performance (for
decompression purposes, mainly).

I'm wondering if it would be feasible to accept a new filter signature
so that the final buffer would be already available and being the final
data destination (i.e. the data area to be returned to the user).
However, I presume this is going to be difficult for two reasons:

1) Adding the possibility of the new signature could be complex and may
require an API change in HDF5.

2) I suppose that it is very unlikely that the destination area for the
filter would coincide with the final data area because the HDF5 data
type conversion machinery could still be in the middle.

Apart from 1), I suppose that if my guess in 2) is correct, perhaps the
type conversion could still be inhibited in the case that a conversion
would not be necessary (i.e. when disk types and memory types match).

Thoughts?

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

andrew.collette · April 30, 2009, 5:33pm

Hi Francesc,

I'm experimenting with very simple but fast compression algorithms with
the aim to do a third-party filter in HDF5. During my implementation
it seems like that the additional copy that requires HDF5 in order to
tranfer the data from the filter output buffer to the final destination
data area is going to have some impact in the final performance (for
decompression purposes, mainly).

Is this LZO or do you something even faster in the works?

2) I suppose that it is very unlikely that the destination area for the
filter would coincide with the final data area because the HDF5 data
type conversion machinery could still be in the middle.

If you're talking about the copy HDF5 has to do to write to the user's
data area, it seems like there's an even bigger obstacle; the
destination dataspace isn't guaranteed to be contiguous. You'd have
to assume responsibility for scattering each chunk to its destination,
or restrict yourself to the case of contiguous memory for read/write.

Apart from 1), I suppose that if my guess in 2) is correct, perhaps the
type conversion could still be inhibited in the case that a conversion
would not be necessary (i.e. when disk types and memory types match).

I know type conversion for e.g. H5Tconvert (and for any custom
conversion callbacks you write) is performed in-place. This might
help avoid extra copying.

Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 1, 2009, 1:32am

Hi Francesc,

Hi,

I'm experimenting with very simple but fast compression algorithms with
the aim to do a third-party filter in HDF5. During my implementation
it seems like that the additional copy that requires HDF5 in order to
tranfer the data from the filter output buffer to the final destination
data area is going to have some impact in the final performance (for
decompression purposes, mainly).

I'm wondering if it would be feasible to accept a new filter signature
so that the final buffer would be already available and being the final
data destination (i.e. the data area to be returned to the user).

No, this isn't going to work. At least, not with the current architecture of the library. The filters are applied to chunks before any of the higher-level code sees the data elements.

However, I presume this is going to be difficult for two reasons:

1) Adding the possibility of the new signature could be complex and may
require an API change in HDF5.

More than an API change, certainly...

2) I suppose that it is very unlikely that the destination area for the
filter would coincide with the final data area because the HDF5 data
type conversion machinery could still be in the middle.

It's definitely in the middle, along with the gather/scatter code for handling dataspace selections.

Apart from 1), I suppose that if my guess in 2) is correct, perhaps the
type conversion could still be inhibited in the case that a conversion
would not be necessary (i.e. when disk types and memory types match).

It would be possible to optimize for a situation where there was no datatype conversion and the dataspace's gather/scatter was equivalent to a memcpy(), but I think that's a small percentage of the use cases and probably would argue against the effort of making the internal architecture changes necessary to support it.

Quincey

···

On Apr 30, 2009, at 11:13 AM, Francesc Alted wrote:

Francesc_Alted2 · April 30, 2009, 6:46pm

Hi Andrew,

A Thursday 30 April 2009, escriguéreu:

Hi Francesc,

> I'm experimenting with very simple but fast compression algorithms
> with the aim to do a third-party filter in HDF5. During my
> implementation it seems like that the additional copy that requires
> HDF5 in order to tranfer the data from the filter output buffer to
> the final destination data area is going to have some impact in the
> final performance (for decompression purposes, mainly).

Is this LZO or do you something even faster in the works?

Well, I can tell that what I'm cooking is a sort of mix with a fast
compressor/decompressor (based on the excellent FastLZ) and shuffling
capabilities in an integrated package. This combination is pretty fast
for many datasets, reaching decompressing speeds that are similar to a
plain memcpy (most specially for 4-byte and 8-byte wide data types).
This is why I am trying to avoid as much copies as I can.

> 2) I suppose that it is very unlikely that the destination area for
> the filter would coincide with the final data area because the HDF5
> data type conversion machinery could still be in the middle.

If you're talking about the copy HDF5 has to do to write to the
user's data area, it seems like there's an even bigger obstacle; the
destination dataspace isn't guaranteed to be contiguous. You'd have
to assume responsibility for scattering each chunk to its
destination, or restrict yourself to the case of contiguous memory
for read/write.

Yeah, I supposed that this is going to be difficult for every case. I
was thinking on situations where the user wants a contiguous part of
the underlying disk dataset, and he is providing a contiguous (and
properly aligned) buffer as memory container too. Don't know whether
this scenario can be communicated to HDF5 or not.

> Apart from 1), I suppose that if my guess in 2) is correct, perhaps
> the type conversion could still be inhibited in the case that a
> conversion would not be necessary (i.e. when disk types and memory
> types match).

I know type conversion for e.g. H5Tconvert (and for any custom
conversion callbacks you write) is performed in-place. This might
help avoid extra copying.

Aha, could be.

Finally, I've been thinking that, provided that the contiguous user
buffer can be used, and in order to avoid an API change, HDF5 could use
*that* buffer to put the compressed data and then let the filter do an
in-place decompression. Of course, as in-place decompression is
difficult, the filter should internally do a copy of the compressed
data first, but this could be much faster than doing the copy with the
decompressed data itself (i.e. there is less data to be copied). But I
suppose this is going to be really difficult as it would suppose an
important change in HDF5 data workflow :-\

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 1, 2009, 1:34am

Hi Francesc,

Hi Andrew,

A Thursday 30 April 2009, escriguéreu:

Hi Francesc,

I'm experimenting with very simple but fast compression algorithms
with the aim to do a third-party filter in HDF5. During my
implementation it seems like that the additional copy that requires
HDF5 in order to tranfer the data from the filter output buffer to
the final destination data area is going to have some impact in the
final performance (for decompression purposes, mainly).

Is this LZO or do you something even faster in the works?

Well, I can tell that what I'm cooking is a sort of mix with a fast
compressor/decompressor (based on the excellent FastLZ) and shuffling
capabilities in an integrated package. This combination is pretty fast
for many datasets, reaching decompressing speeds that are similar to a
plain memcpy (most specially for 4-byte and 8-byte wide data types).
This is why I am trying to avoid as much copies as I can.

2) I suppose that it is very unlikely that the destination area for
the filter would coincide with the final data area because the HDF5
data type conversion machinery could still be in the middle.

If you're talking about the copy HDF5 has to do to write to the
user's data area, it seems like there's an even bigger obstacle; the
destination dataspace isn't guaranteed to be contiguous. You'd have
to assume responsibility for scattering each chunk to its
destination, or restrict yourself to the case of contiguous memory
for read/write.

Yeah, I supposed that this is going to be difficult for every case. I
was thinking on situations where the user wants a contiguous part of
the underlying disk dataset, and he is providing a contiguous (and
properly aligned) buffer as memory container too. Don't know whether
this scenario can be communicated to HDF5 or not.

As I mentioned in my last note, it is communicated to the internals of the raw data I/O routines, but I think it's going to require quite a bit more work to carry down further, to the filter code for chunked dataset I/O.

Apart from 1), I suppose that if my guess in 2) is correct, perhaps
the type conversion could still be inhibited in the case that a
conversion would not be necessary (i.e. when disk types and memory
types match).

I know type conversion for e.g. H5Tconvert (and for any custom
conversion callbacks you write) is performed in-place. This might
help avoid extra copying.

Aha, could be.

Finally, I've been thinking that, provided that the contiguous user
buffer can be used, and in order to avoid an API change, HDF5 could use
*that* buffer to put the compressed data and then let the filter do an
in-place decompression. Of course, as in-place decompression is
difficult, the filter should internally do a copy of the compressed
data first, but this could be much faster than doing the copy with the
decompressed data itself (i.e. there is less data to be copied). But I
suppose this is going to be really difficult as it would suppose an
important change in HDF5 data workflow :-\

Yup.

Quincey

···

On Apr 30, 2009, at 1:46 PM, Francesc Alted wrote:

Francesc_Alted2 · May 1, 2009, 9:11am

Hi Quincey,

A Friday 01 May 2009, Quincey Koziol escrigué:

> 1) Adding the possibility of the new signature could be complex and
> may
> require an API change in HDF5.

  More than an API change, certainly...

> 2) I suppose that it is very unlikely that the destination area for
> the
> filter would coincide with the final data area because the HDF5
> data type conversion machinery could still be in the middle.

  It's definitely in the middle, along with the gather/scatter code
for handling dataspace selections.

> Apart from 1), I suppose that if my guess in 2) is correct, perhaps
> the
> type conversion could still be inhibited in the case that a
> conversion would not be necessary (i.e. when disk types and memory
> types match).

  It would be possible to optimize for a situation where there was no
datatype conversion and the dataspace's gather/scatter was equivalent
to a memcpy(), but I think that's a small percentage of the use cases
and probably would argue against the effort of making the internal
architecture changes necessary to support it.

Ok. So to recap, with the current HDF5 data workflow and for a
decompression filter, there is a need to do at least 3 internal copies
during the compressed read pipeline, namely:

1) The first one performed internally by the decompression filter

2) Another one for the type conversion layer

3) Finally, another for the gather/scatter layer

In addition, if I understand correctly, even 2) and 3) may risk to
perform a copy in a non-optimized way (i.e. the trivial conversion
cases could not be translated into a straight system memcpy() call).

Unfortunately, that's worse than I was afraid, and definitely would
render almost useless a decompressor similar in speed to memcpy() for
my purposes. My idea was to try to use it to being able to operate
with highly compressible datasets on-disk almost as fast as if they
were in-memory. However, this seems definitely not possible (with the
current HDF5 at least). Having said that, operating at 1/3 of memcpy()
speed still seems very compelling.

Thanks for the feedback anyway,

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 1, 2009, 11:34am

Hi Francesc,

Hi Quincey,

A Friday 01 May 2009, Quincey Koziol escrigué:

1) Adding the possibility of the new signature could be complex and
may
require an API change in HDF5.

  More than an API change, certainly...

2) I suppose that it is very unlikely that the destination area for
the
filter would coincide with the final data area because the HDF5
data type conversion machinery could still be in the middle.

  It's definitely in the middle, along with the gather/scatter code
for handling dataspace selections.

Apart from 1), I suppose that if my guess in 2) is correct, perhaps
the
type conversion could still be inhibited in the case that a
conversion would not be necessary (i.e. when disk types and memory
types match).

  It would be possible to optimize for a situation where there was no
datatype conversion and the dataspace's gather/scatter was equivalent
to a memcpy(), but I think that's a small percentage of the use cases
and probably would argue against the effort of making the internal
architecture changes necessary to support it.

Ok. So to recap, with the current HDF5 data workflow and for a
decompression filter, there is a need to do at least 3 internal copies
during the compressed read pipeline, namely:

1) The first one performed internally by the decompression filter

2) Another one for the type conversion layer

3) Finally, another for the gather/scatter layer

Actually, there's only 2 internal copies - there's no extra internal buffer for 3), there's specialized code for performing a simultaneous gather/scatter directly from the source buffer to the destination when there's no type conversion. It should be close to the speed of a memcpy()...

In addition, if I understand correctly, even 2) and 3) may risk to
perform a copy in a non-optimized way (i.e. the trivial conversion
cases could not be translated into a straight system memcpy() call).

Unfortunately, that's worse than I was afraid, and definitely would
render almost useless a decompressor similar in speed to memcpy() for
my purposes. My idea was to try to use it to being able to operate
with highly compressible datasets on-disk almost as fast as if they
were in-memory. However, this seems definitely not possible (with the
current HDF5 at least). Having said that, operating at 1/3 of memcpy()
speed still seems very compelling.

I would spend some time profiling the code for your use case(s) and see if the compression time was the bottleneck, before going forward/giving up.

Quincey

···

On May 1, 2009, at 4:11 AM, Francesc Alted wrote:

Francesc_Alted2 · May 1, 2009, 8:41pm

Quincey,

A Friday 01 May 2009, Quincey Koziol escrigué:

> Ok. So to recap, with the current HDF5 data workflow and for a
> decompression filter, there is a need to do at least 3 internal
> copies during the compressed read pipeline, namely:
>
> 1) The first one performed internally by the decompression filter
>
> 2) Another one for the type conversion layer
>
> 3) Finally, another for the gather/scatter layer

Actually, there's only 2 internal copies - there's no extra internal
buffer for 3), there's specialized code for performing a simultaneous
gather/scatter directly from the source buffer to the destination
when there's no type conversion. It should be close to the speed of
a memcpy()...

Uh, you lost me. If there is a source buffer and a destination one,
then there should necessarily be a copy, right? Or you mean that for
the special case not needing type conversion 2) and 3) would collapse
into a single copy? In that case, this would be really great news.

> In addition, if I understand correctly, even 2) and 3) may risk to
> perform a copy in a non-optimized way (i.e. the trivial conversion
> cases could not be translated into a straight system memcpy()
> call).
>
> Unfortunately, that's worse than I was afraid, and definitely would
> render almost useless a decompressor similar in speed to memcpy()
> for my purposes. My idea was to try to use it to being able to
> operate with highly compressible datasets on-disk almost as fast as
> if they were in-memory. However, this seems definitely not
> possible (with the current HDF5 at least). Having said that,
> operating at 1/3 of memcpy()
> speed still seems very compelling.

I would spend some time profiling the code for your use case(s) and
see if the compression time was the bottleneck, before going forward/
giving up.

Yeah, definitely. With the 2 copies scenario I'm a bit more optimistic,
but some measurements should be done so as to exclude other possible
bottlenecks. I'll report here my findings.

Thanks!

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · May 4, 2009, 8:56pm

Hi Francesc,

Quincey,

A Friday 01 May 2009, Quincey Koziol escrigué:

Ok. So to recap, with the current HDF5 data workflow and for a
decompression filter, there is a need to do at least 3 internal
copies during the compressed read pipeline, namely:

1) The first one performed internally by the decompression filter

2) Another one for the type conversion layer

3) Finally, another for the gather/scatter layer

Actually, there's only 2 internal copies - there's no extra internal
buffer for 3), there's specialized code for performing a simultaneous
gather/scatter directly from the source buffer to the destination
when there's no type conversion. It should be close to the speed of
a memcpy()...

Uh, you lost me. If there is a source buffer and a destination one,
then there should necessarily be a copy, right? Or you mean that for
the special case not needing type conversion 2) and 3) would collapse
into a single copy? In that case, this would be really great news.

Yes, that's what I meant.

Quincey

···

On May 1, 2009, at 3:41 PM, Francesc Alted wrote:

In addition, if I understand correctly, even 2) and 3) may risk to
perform a copy in a non-optimized way (i.e. the trivial conversion
cases could not be translated into a straight system memcpy()
call).

Unfortunately, that's worse than I was afraid, and definitely would
render almost useless a decompressor similar in speed to memcpy()
for my purposes. My idea was to try to use it to being able to
operate with highly compressible datasets on-disk almost as fast as
if they were in-memory. However, this seems definitely not
possible (with the current HDF5 at least). Having said that,
operating at 1/3 of memcpy()
speed still seems very compelling.

I would spend some time profiling the code for your use case(s) and
see if the compression time was the bottleneck, before going forward/
giving up.

Yeah, definitely. With the 2 copies scenario I'm a bit more optimistic,
but some measurements should be done so as to exclude other possible
bottlenecks. I'll report here my findings.

Thanks!

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted2 · May 5, 2009, 5:04pm

Hi Quincey,

A Monday 04 May 2009, Quincey Koziol escrigué:

>>> 1) The first one performed internally by the decompression filter
>>>
>>> 2) Another one for the type conversion layer
>>>
>>> 3) Finally, another for the gather/scatter layer
>>
>> Actually, there's only 2 internal copies - there's no extra
>> internal buffer for 3), there's specialized code for performing a
>> simultaneous gather/scatter directly from the source buffer to the
>> destination when there's no type conversion. It should be close
>> to the speed of a memcpy()...
>
> Uh, you lost me. If there is a source buffer and a destination
> one, then there should necessarily be a copy, right? Or you mean
> that for the special case not needing type conversion 2) and 3)
> would collapse into a single copy? In that case, this would be
> really great news.

Yes, that's what I meant.

Ok. I've conducted some preliminary benchmarks and profilings, and it
seems that they confirm your predictions: for the trivial case where
there is not scatter/gather operation nor type conversion, HDF5
apparently only needs just 2 additional memcpy operations per chunk.
And, as normally the chunksize fits confortably in cache level 2 of
modern CPUs, the additional memcpy() over the chunks are pretty fast.

Here there are some figures that I'm getting with my compression filter.
I'm creating a 1 GB file of (very compressible) floats and reading
afterwards. The output shows the speeds for read/write operation and
for several chunksizes.

Chunksize of 8 KB:
Time for writing file of 1024 MB: 3 s (341.3 MB/s)
Time for reading file of 1024 MB: 1.51 s (678.1 MB/s)

Chunksize of 32 KB:
Time for writing file of 1024 MB: 1.57 s (652.2 MB/s)
Time for reading file of 1024 MB: 0.76 s (1347.4 MB/s)

Chunksize of 128 KB:
Time for writing file of 1024 MB: 1.09 s (939.4 MB/s)
Time for reading file of 1024 MB: 0.4 s (2560.0 MB/s)

Chunksize of 512 KB:
Time for writing file of 1024 MB: 1.25 s (819.2 MB/s)
Time for reading file of 1024 MB: 0.59 s (1735.6 MB/s)

Initially I thought that a small chunksize (8 KB) would be better as it
would fit the cache level 1 of my processor (which is much faster than
its L2 counterpart). However, a look at the decompression profiles
(done with cachegrind, attached) seems to indicate that the overhead of
doing more calls, and probably a bigger HDF5 BTree, makes small
chunksizes rather slower. For large chunksizes (512 KB) it seems that
the number of reads/writes grows significantly during the decompression
process. I'm not certain about this latter effect, but it is there.

An optimal chunksize for this case seems to be 128 KB, where a 2.5 GB/s
decompression speed can be attained. My initial benchmarks without the
HDF5 layer showed that the top speed with such a chunksize was around
4.6 GB/s. So, it seems clear that the 2 additional memcpy() operations
are the main responsibles for the slowdown.

Of course, all of this is for very compressible data, but I wanted to
know the effect of HDF5 for this 'worst' case. For data more 'real'
(i.e. less compressible) the effect of the HDF5 layer in the
decompressor filter speed should be still less noticiable. So, it
seems that HDF5 layers are not too much intrusive for nowadays
processors (although it is *already* noticeable). However, I'd say
that this would eventually become a serious problem when future
compressors/processors would be much more effective
compressing/decompressing binary data, so it would be nice to have this
in mind for future HDF5 versions.

Cheers,

blosc_8k.cg (13.7 KB)

blosc_512k.cg (1.9 KB)

blosc_32k.cg (10.3 KB)

blosc_128k.cg (7.04 KB)

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"

Quincey_Koziol · May 5, 2009, 11:20pm

Hi Francesc,

···

On May 5, 2009, at 12:04 PM, Francesc Alted wrote:

Hi Quincey,

A Monday 04 May 2009, Quincey Koziol escrigué:

1) The first one performed internally by the decompression filter

2) Another one for the type conversion layer

3) Finally, another for the gather/scatter layer

Actually, there's only 2 internal copies - there's no extra
internal buffer for 3), there's specialized code for performing a
simultaneous gather/scatter directly from the source buffer to the
destination when there's no type conversion. It should be close
to the speed of a memcpy()...

Uh, you lost me. If there is a source buffer and a destination
one, then there should necessarily be a copy, right? Or you mean
that for the special case not needing type conversion 2) and 3)
would collapse into a single copy? In that case, this would be
really great news.

Yes, that's what I meant.

Ok. I've conducted some preliminary benchmarks and profilings, and it
seems that they confirm your predictions: for the trivial case where
there is not scatter/gather operation nor type conversion, HDF5
apparently only needs just 2 additional memcpy operations per chunk.
And, as normally the chunksize fits confortably in cache level 2 of
modern CPUs, the additional memcpy() over the chunks are pretty fast.

Here there are some figures that I'm getting with my compression filter.
I'm creating a 1 GB file of (very compressible) floats and reading
afterwards. The output shows the speeds for read/write operation and
for several chunksizes.

Chunksize of 8 KB:
Time for writing file of 1024 MB: 3 s (341.3 MB/s)
Time for reading file of 1024 MB: 1.51 s (678.1 MB/s)

Chunksize of 32 KB:
Time for writing file of 1024 MB: 1.57 s (652.2 MB/s)
Time for reading file of 1024 MB: 0.76 s (1347.4 MB/s)

Chunksize of 128 KB:
Time for writing file of 1024 MB: 1.09 s (939.4 MB/s)
Time for reading file of 1024 MB: 0.4 s (2560.0 MB/s)

Chunksize of 512 KB:
Time for writing file of 1024 MB: 1.25 s (819.2 MB/s)
Time for reading file of 1024 MB: 0.59 s (1735.6 MB/s)

Initially I thought that a small chunksize (8 KB) would be better as it
would fit the cache level 1 of my processor (which is much faster than
its L2 counterpart). However, a look at the decompression profiles
(done with cachegrind, attached) seems to indicate that the overhead of
doing more calls, and probably a bigger HDF5 BTree, makes small
chunksizes rather slower. For large chunksizes (512 KB) it seems that
the number of reads/writes grows significantly during the decompression
process. I'm not certain about this latter effect, but it is there.

An optimal chunksize for this case seems to be 128 KB, where a 2.5 GB/s
decompression speed can be attained. My initial benchmarks without the
HDF5 layer showed that the top speed with such a chunksize was around
4.6 GB/s. So, it seems clear that the 2 additional memcpy() operations
are the main responsibles for the slowdown.

Of course, all of this is for very compressible data, but I wanted to
know the effect of HDF5 for this 'worst' case. For data more 'real'
(i.e. less compressible) the effect of the HDF5 layer in the
decompressor filter speed should be still less noticiable. So, it
seems that HDF5 layers are not too much intrusive for nowadays
processors (although it is *already* noticeable). However, I'd say
that this would eventually become a serious problem when future
compressors/processors would be much more effective
compressing/decompressing binary data, so it would be nice to have this
in mind for future HDF5 versions.

Interesting results... Thanks for looking at this more, sometime we might have enough funding/motivation to optimize all the way down to decompressing directly into the application buffer, but I'm not yet convinced that there aren't other low-hanging fruit to gather first.

Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted2 · May 6, 2009, 10:30am

Quincey,

A Wednesday 06 May 2009, Quincey Koziol escrigué:
[clip]

> Of course, all of this is for very compressible data, but I wanted
> to know the effect of HDF5 for this 'worst' case. For data more
> 'real' (i.e. less compressible) the effect of the HDF5 layer in the
> decompressor filter speed should be still less noticiable. So, it
> seems that HDF5 layers are not too much intrusive for nowadays
> processors (although it is *already* noticeable). However, I'd say
> that this would eventually become a serious problem when future
> compressors/processors would be much more effective
> compressing/decompressing binary data, so it would be nice to have
> this
> in mind for future HDF5 versions.

Interesting results... Thanks for looking at this more,
sometime we might have enough funding/motivation to optimize all the
way down to decompressing directly into the application buffer, but
I'm not yet convinced that there aren't other low-hanging fruit to
gather first.

And your resistence to believe is probably well founded In fact,
I've counted badly the number of additional memcpy() operations.
Initially I've assumed an implicit 'copy' between the compressed chunk
buffer and the decompressed buffer, but this is unavoidable for every
compressor/decompressor, so it should not be counted as an additional
copy. So, for the simplest case, HDF5 only makes *1* additional
memcpy() (and this number is compatible with my measurements), which
albeit not optimal, it is close.

Besides, as the only way of improving this is to decompress directly
into the application buffer, and as this implies significant changes in
HDF5, I don't think now this would be that urgent neither

Thanks for all the feedback,

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"