A Monday 12 April 2010 18:40:33 Stamminger, Johannes escrigué:
> Fine to have some more feedback! 
>
>
> Did the attached spreadsheet make it's way to the forum? Or was it
> filtered?
Yes, it made into the list.
Great and good to know - this makes things much easier than with having
to provide attachments some different channels! 
It is a small file and OpenOffice can open it
... as it is created using OO - remember I'm running linux 
easily, so I suppose that it is fine if you send more of these (although if
you can come up with a PDF file would be better).
In principle absolutely correct. Thus it contains calculations and with
pdf no one could verify them any longer (I found already several times -
until now only *before* having posted it - wrong cell references inside
them). This way someone wondering might check them ... so I personally
would prefer to stay with the OO-xls for best compatibility ... ?
> > Don't know about this one, but it is certainly strange this dramatic loss
> > in performance when passing from 64 KB to 128 KB chunksize. It would be
> > nice if you can build a small benchmark showing this problem in
> > performance and send it to the HDF group for further analysis.
>
> I may extract this test with some small much effort. But it is java then
> wrapping the native shared libraries. And it is *not* hdf-java as this
> does not support the H5PT but using JNA for that purpose.
>
> Still interested?
I suppose so, but you should ask the THG helpdesk just to be sure 
I will contact them then for this then, too.
> I'm still measuring - but I was supprised again from the findings. E.g.
> with arrays of size 16384 it seems best to use chunksize 32, compression
> level 4 and to write as much as possible (maybe there is an upper limit
> that I did not reach, yet) arrays with a single call to H5PTappend. With
> that I get the data written in 217s to a file of size 160MB.
>
> The data is the same as I used for writing the strings. But now without
> conversion to hex string, 468M bytes in sum. With the overhead of the
> fixed length arrays total data written to the file is of size 16,2G (the
> overhead bytes are zero'ed). With the latter in mind the resulting
> filesize of 160M is quite imaginable. But compared with writing same
> data to a zip with on-the-fly inflation it is not as this leads to 50M
> in 65s (with no performance tuning like writing data in blocks etc) ...
Well, 160 MB wrt 468 MB is quite fine. Indeed zip is compressing better
because of a series of reasons. First one is that zip is probably using
larger block sizes here. In addition, HDF5 is designed to be able to look at
each chunk directly, not sequentially, so it has to add some overhead (in the
form of a B-tree) to quickly locate chunks; you cannot (as far as I know) do
the same with zip. Finally, have in mind that you are actually compressing
16.2 GB instead of 468 MB. And although most of the 16.2 GB are zeros, the
compressor still have to walk, chew and code them. So, you can never expect
to have the same speed/compression ratio than zip in this scenario.
You are absolutely correct, and I'm aware of the benefits with having
the improved data access.
Unfortunately the app user does not see this and will complain if the
application runs 10 times slower and creates 10 time bigger files :-(.
Therefore I *must* reach comparable times and sizes. The files may grow
"a bit" but not by factors.
And as I wrote already in this thread afterwards, it seems realistic
indeed: with using multiple packet tables of different fixed array sizes
(to reduce the 0's overhead and save time spent for compressing and
writing the 0's) in parallel. But today I will have to maintain
additionally a dataset with references to those arrays in the correct
order. Hopefully this does not increase file size that much ...
But I'm curious when you say that you were converting data to hex string. Why
were you doing so? If your data are typically ints or floats, you may want to
This was just a workaround for my problems with writing varying length
binary data. In fact I have a series of varying length binary blobs (max
possible size 2^16 bytes; in my test data mean size 465, max size 1435).
And as I failed to write them as varying length binary data (see thread
"Varying length binary data") I used as workaround the hex strings -
just for performance measuring.
use the shuffle filter in combination with zlib. In many circumstances,
shuffle may buy you significant additional compression ratio. This is
Good to know. You know of whether this is shown in an example?
something that zip cannot do (it can only compress streams of bytes, as it has
not the notion of ints/floats).
> With big chunksize both, performance and file size, degrade by a large
> factor. Worst example was to have 819K leading to a file of 513M (50
> arrays with 16384 bytes each, compression 0, chunk size 32K).
Uh, you lost me. What is 819 KB, the chunksize?
Oh no, this is the size of the data having written to the hdf: 50 arrays
of size 16384 bytes each. Written to hdf with compression level 0, chunk
size 32K.
Thanks for all hints,
Johannes Stamminger
···
On Mo, 2010-04-12 at 20:46 +0200, Francesc Alted wrote: