A Monday 04 May 2009, Quincey Koziol escrigué:
>>> 1) The first one performed internally by the decompression filter
>>> 2) Another one for the type conversion layer
>>> 3) Finally, another for the gather/scatter layer
>> Actually, there's only 2 internal copies - there's no extra
>> internal buffer for 3), there's specialized code for performing a
>> simultaneous gather/scatter directly from the source buffer to the
>> destination when there's no type conversion. It should be close
>> to the speed of a memcpy()...
> Uh, you lost me. If there is a source buffer and a destination
> one, then there should necessarily be a copy, right? Or you mean
> that for the special case not needing type conversion 2) and 3)
> would collapse into a single copy? In that case, this would be
> really great news.
Yes, that's what I meant.
Ok. I've conducted some preliminary benchmarks and profilings, and it
seems that they confirm your predictions: for the trivial case where
there is not scatter/gather operation nor type conversion, HDF5
apparently only needs just 2 additional memcpy operations per chunk.
And, as normally the chunksize fits confortably in cache level 2 of
modern CPUs, the additional memcpy() over the chunks are pretty fast.
Here there are some figures that I'm getting with my compression filter.
I'm creating a 1 GB file of (very compressible) floats and reading
afterwards. The output shows the speeds for read/write operation and
for several chunksizes.
Chunksize of 8 KB:
Time for writing file of 1024 MB: 3 s (341.3 MB/s)
Time for reading file of 1024 MB: 1.51 s (678.1 MB/s)
Chunksize of 32 KB:
Time for writing file of 1024 MB: 1.57 s (652.2 MB/s)
Time for reading file of 1024 MB: 0.76 s (1347.4 MB/s)
Chunksize of 128 KB:
Time for writing file of 1024 MB: 1.09 s (939.4 MB/s)
Time for reading file of 1024 MB: 0.4 s (2560.0 MB/s)
Chunksize of 512 KB:
Time for writing file of 1024 MB: 1.25 s (819.2 MB/s)
Time for reading file of 1024 MB: 0.59 s (1735.6 MB/s)
Initially I thought that a small chunksize (8 KB) would be better as it
would fit the cache level 1 of my processor (which is much faster than
its L2 counterpart). However, a look at the decompression profiles
(done with cachegrind, attached) seems to indicate that the overhead of
doing more calls, and probably a bigger HDF5 BTree, makes small
chunksizes rather slower. For large chunksizes (512 KB) it seems that
the number of reads/writes grows significantly during the decompression
process. I'm not certain about this latter effect, but it is there.
An optimal chunksize for this case seems to be 128 KB, where a 2.5 GB/s
decompression speed can be attained. My initial benchmarks without the
HDF5 layer showed that the top speed with such a chunksize was around
4.6 GB/s. So, it seems clear that the 2 additional memcpy() operations
are the main responsibles for the slowdown.
Of course, all of this is for very compressible data, but I wanted to
know the effect of HDF5 for this 'worst' case. For data more 'real'
(i.e. less compressible) the effect of the HDF5 layer in the
decompressor filter speed should be still less noticiable. So, it
seems that HDF5 layers are not too much intrusive for nowadays
processors (although it is *already* noticeable). However, I'd say
that this would eventually become a serious problem when future
compressors/processors would be much more effective
compressing/decompressing binary data, so it would be nice to have this
in mind for future HDF5 versions.
blosc_8k.cg (13.7 KB)
blosc_512k.cg (1.9 KB)
blosc_32k.cg (10.3 KB)
blosc_128k.cg (7.04 KB)
"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."
-- Edsger W. Dykstra
"On the cruelty of really teaching computer science"