Hi Peter,
Kim, how are you? Out of curiosity, I would like to clarify something.
I am a French man working in IT.
I am working on parallel implementation of CRF
(http://en.wikipedia.org/wiki/Conditional_random_field\) learning (for
my fun only).
My naïve impression was that, if your keys are French words, the dataset
can't be very large.
Let's say that a highly educated English speaker has about 300,000 words in
his/her vocabulary.
Let's give a French speaker 500,000 words. Add two integers (8 bytes per
entry) as payload.
In this hashTable i have one entry bye feature in my corpus and my
corpus is huge an extract of the French wikipedia.
The example en my last email was bad.
in-memory hashtable/dictionary lookup (even with a generic hash function
that doesn't speak French).
Why can't you hold that in memory?
My original implementation use a memory hasttable, but this hashtable
use more than my 12 Go of memory.
My second implementation use a cached hastable with "ehcache", with
this cache a can manage the memory but the read performance was very
bad. This learning take more than 10 days with two cluster node.
Is your question that you don't want to regenerate that hashtable all the
time,
and that's why you'd like to store it on disk in HDF5?
(Again, HDF5 has no DBMS like query engine and I don't see why you'd need
that.)
My issue is not a serialization trouble but a trouble for read a this
hashtable when it was to large for memory.
I try DBMS but the performance where more bad than cached hashTable,
index on the word is not discriminant.
I hope that a binary file, help me.
I want to have a java object with a Add and GetValueByKey method, the
more important for me is the read performance, this structure will be
read for very long time (more 10 days, with memory hashtable).
Thanks for your help and excuse my bad English
Regards
···
2011/7/19 Gerd Heber <gheber@hdfgroup.org>:
Kim, how are you? Out of curiosity, I would like to clarify something.
My naïve impression was that, if your keys are French words, the dataset
can't be very large.
Let's say that a highly educated English speaker has about 300,000 words in
his/her vocabulary.
Let's give a French speaker 500,000 words. Add two integers (8 bytes per
entry) as payload.
Now, I don't know what the histogram for French word length looks like.
Most words are probably less than 16-20 characters long (with a long flat
tail).
So even at 100 bytes per entry, we'd be looking at a ~100 MB hash table.
(Even on a smart phone that's not a terrible size.)
No matter how you organize the dataset(s) in HDF5 on disk, you are not going
to beat an
in-memory hashtable/dictionary lookup (even with a generic hash function
that doesn't speak French).
Why can't you hold that in memory?
Is your question that you don't want to regenerate that hashtable all the
time,
and that's why you'd like to store it on disk in HDF5?
(Again, HDF5 has no DBMS like query engine and I don't see why you'd need
that.)
Best, G.
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org