A Monday 26 November 2007, Dominik Szczerba escrigué:
[snip]
And I would also be interested to know what bad can happen if I use
the size of the dataset as the chunk size (so one big chunk). Will it
possibly lead to memory problems and if yes how will hdf5 deal with
it?
Well, it depends on the extent of your dataset, but I'd say that using
one single chunk for the complete dataset is in general a very bad
idea.
The thing is, if you are using compression, and try to randomly access
your dataset, you will need to decompress the complete dataset before
being able to retrieve the interesting slice, and this can be overkill.
If you data access pattern is purely sequential, this effect is not so
important, but still, there is some significant overhead derived from
the fact that the compressor has to deal with larger chunks. In
general, my experience says that it is best to use contained
chunksizes.
I've done a small experiment in order to determine 'optimal' chunksizes.
I've used PyTables for doing this, but the results should be fairly
applicable to HDF5 in general. I'm attaching the script, the output
and some plots on the output. With a dataset of 2 GB, it is apparent
that using a chunksize between 32KB and 128KB gives the best results.
Use 32KB if you want to optimize the creation and random access, while
128KB might be more appropriate if sequential read is the most
important to you. Also, the compression ratio can be positively
affected by using a large chunksize (128KB). In any case, it doesn't
seem useful to use chunksizes larger than 128KB (unless you have
*really* large datasets).
Hope that helps,
chunksize-zlib-shuffle.out (10.8 KB)
optimal-chunksize.py (3.15 KB)
···
--
0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
Thanks a lot for input. I know about sequencing, but my data are to be
read/writen all into/from memory at highest possible speed.
OK, the number 128K / 2GB based on your experiments is a very useful piece of
knowledge. Thanks a lot!
-- Dominik
···
On Tuesday 27 November 2007 09.30:19 Francesc Altet wrote:
A Monday 26 November 2007, Dominik Szczerba escrigué:
[snip]
> And I would also be interested to know what bad can happen if I use
> the size of the dataset as the chunk size (so one big chunk). Will it
> possibly lead to memory problems and if yes how will hdf5 deal with
> it?
Well, it depends on the extent of your dataset, but I'd say that using
one single chunk for the complete dataset is in general a very bad
idea.
The thing is, if you are using compression, and try to randomly access
your dataset, you will need to decompress the complete dataset before
being able to retrieve the interesting slice, and this can be overkill.
If you data access pattern is purely sequential, this effect is not so
important, but still, there is some significant overhead derived from
the fact that the compressor has to deal with larger chunks. In
general, my experience says that it is best to use contained
chunksizes.
I've done a small experiment in order to determine 'optimal' chunksizes.
I've used PyTables for doing this, but the results should be fairly
applicable to HDF5 in general. I'm attaching the script, the output
and some plots on the output. With a dataset of 2 GB, it is apparent
that using a chunksize between 32KB and 128KB gives the best results.
Use 32KB if you want to optimize the creation and random access, while
128KB might be more appropriate if sequential read is the most
important to you. Also, the compression ratio can be positively
affected by using a large chunksize (128KB). In any case, it doesn't
seem useful to use chunksizes larger than 128KB (unless you have
*really* large datasets).
Hope that helps,
--
Dominik Szczerba, Ph.D.
Computer Vision Lab CH-8092 Zurich
http://www.vision.ee.ethz.ch/~domi
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.