HDF JAVA AND HASHTABLE

kim · July 18, 2011, 6:21pm

I am totally newbie with HDF.

I want to use HDF5 to serialize a hashTable. This hashTable is to
large to be managed in memory and i want use HDF for read performance.

The hash table have a String for key and a object (1 string and 2
int) for value.

Have you a java sample code for create this structure, write data and read data.

Best regards

kim

France

Peter_Cao · July 18, 2011, 7:07pm

Kim,

It would be very helpful if you can explain the issue in more details.

My guess is that you are trying to read compound data with structure
of {string, int} from HDF5 file into Java hashtable. Similar for writing.

Currently, there is no direct way to read/write Java hashtable from/to HDF5
files. There are a couple of ways that you can do.
A) Implement your own Java and JNI write functions that can read/write
hashtable, or
B) Use H5CompoundDS to read data from file and create the hashtable
once you have the data in memory, similar for write.

Option A) will have better performance but it will need more work. Option B)
does not need much work but it requires more memory and extra data conversion.

Thanks
--pc

···

On 7/18/2011 1:21 PM, kim nguyen wrote:

I am totally newbie with HDF.

I want to use HDF5 to serialize a hashTable. This hashTable is to
large to be managed in memory and i want use HDF for read performance.

The hash table have a String for key and a object (1 string and 2
int) for value.

Have you a java sample code for create this structure, write data and read data.

Best regards

kim

France

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

kim · July 18, 2011, 8:20pm

Hi Peter,

Thanks for your answer.

My issue :

-In a java program i use a hashtable, and this hashtable is to huge for memory.
-First i need to write data in this HDF5 structure.
-Secondly i need to read object {String , int, int} by key {string}.

I think that i can use HDF5 to serialize this structure in a table
{key,String,int,int} :

First a can read a file and for each line add a line in the HDF5 table.

After i need a fast access to this HDF5 table by key.

Thanks a lot for your help.

kim

···

2011/7/18 Peter Cao <xcao@hdfgroup.org>:

Kim,

It would be very helpful if you can explain the issue in more details.

My guess is that you are trying to read compound data with structure
of {string, int} from HDF5 file into Java hashtable. Similar for writing.

Currently, there is no direct way to read/write Java hashtable from/to HDF5
files. There are a couple of ways that you can do.
A) Implement your own Java and JNI write functions that can read/write
hashtable, or
B) Use H5CompoundDS to read data from file and create the hashtable
once you have the data in memory, similar for write.

Option A) will have better performance but it will need more work. Option B)
does not need much work but it requires more memory and extra data
conversion.

Thanks
--pc

On 7/18/2011 1:21 PM, kim nguyen wrote:

I am totally newbie with HDF.

I want to use HDF5 to serialize a hashTable. This hashTable is to
large to be managed in memory and i want use HDF for read performance.

The hash table have a String for key and a object (1 string and 2
int) for value.

Have you a java sample code for create this structure, write data and read
data.

Best regards

kim

France

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Peter_Cao · July 18, 2011, 9:24pm

Hi Kim,

A compound structure of {key,String,int,int} in HDF5 is a good way to go.

Yes, you can write data line by line. Make sure to use chunked layout for
better performance. Check the following page for more information on chunks:
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

HDF5 does not support direct query by keys. Fast access to data in file by keys
will not be easy. You could store a structure of keys, e.g. (key, index) in file or
a hierarchical mapping of keys to index. A good solution will depend on
your applications such as the pattern of your keys.

Thanks
--pc

···

On 7/18/2011 3:20 PM, kim nguyen wrote:

Hi Peter,

Thanks for your answer.

My issue :

-In a java program i use a hashtable, and this hashtable is to huge for memory.
-First i need to write data in this HDF5 structure.
-Secondly i need to read object {String , int, int} by key {string}.

I think that i can use HDF5 to serialize this structure in a table
{key,String,int,int} :

First a can read a file and for each line add a line in the HDF5 table.

After i need a fast access to this HDF5 table by key.

Thanks a lot for your help.

kim

2011/7/18 Peter Cao<xcao@hdfgroup.org>:

Kim,

It would be very helpful if you can explain the issue in more details.

My guess is that you are trying to read compound data with structure
of {string, int} from HDF5 file into Java hashtable. Similar for writing.

Currently, there is no direct way to read/write Java hashtable from/to HDF5
files. There are a couple of ways that you can do.
A) Implement your own Java and JNI write functions that can read/write
hashtable, or
B) Use H5CompoundDS to read data from file and create the hashtable
once you have the data in memory, similar for write.

Option A) will have better performance but it will need more work. Option B)
does not need much work but it requires more memory and extra data
conversion.

Thanks
--pc

On 7/18/2011 1:21 PM, kim nguyen wrote:

I am totally newbie with HDF.

I want to use HDF5 to serialize a hashTable. This hashTable is to
large to be managed in memory and i want use HDF for read performance.

The hash table have a String for key and a object (1 string and 2
int) for value.

Have you a java sample code for create this structure, write data and read
data.

Best regards

kim

France

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

kim · July 18, 2011, 10:09pm

Hi Peter,

Thanks a lot.

A compound structure of {key,String,int,int} in HDF5 is a good way to go.

Have you a sample code in java to create this compound ( and read / write ) ?

Yes, you can write data line by line. Make sure to use chunked layout for
better performance. Check the following page for more information on chunks:
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

In my case i don't have write performance issue, my application write
only one the data and after need a lot of read.

HDF5 does not support direct query by keys. Fast access to data in file by
keys
will not be easy. You could store a structure of keys, e.g. (key, index) in
file or
a hierarchical mapping of keys to index. A good solution will depend on
your applications such as the pattern of your keys.

In my application the key is a French word. The structure contain
statistical data by French word (text mining).

This is a sample of this data.

ils 342 0:13
invention 896 0:1
enfants 1145 0:1
compilations 289 0:1
naturel 257 0:1
douze 311 2:1
ellingtonien 262 0:1
travail 1139 0:2
chanteuse 255 0:1
cycliste 1442 0:1
histoire 860 0:1
mémoires 826 0:1
compter 1368 0:1
nouvelle 835 0:2
côté 518 0:1

Have you a java sample code to create an index or a hierarchical
mapping of keys to index.

Thanks a lot.

kim

···

2011/7/18 Peter Cao <xcao@hdfgroup.org>:

Hi Kim,

A compound structure of {key,String,int,int} in HDF5 is a good way to go.

Yes, you can write data line by line. Make sure to use chunked layout for
better performance. Check the following page for more information on chunks:
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

HDF5 does not support direct query by keys. Fast access to data in file by
keys
will not be easy. You could store a structure of keys, e.g. (key, index) in
file or
a hierarchical mapping of keys to index. A good solution will depend on
your applications such as the pattern of your keys.

Thanks
--pc

On 7/18/2011 3:20 PM, kim nguyen wrote:

Hi Peter,

Thanks for your answer.

My issue :

-In a java program i use a hashtable, and this hashtable is to huge for
memory.
-First i need to write data in this HDF5 structure.
-Secondly i need to read object {String , int, int} by key {string}.

I think that i can use HDF5 to serialize this structure in a table
{key,String,int,int} :

First a can read a file and for each line add a line in the HDF5 table.

After i need a fast access to this HDF5 table by key.

Thanks a lot for your help.

kim

2011/7/18 Peter Cao<xcao@hdfgroup.org>:

Kim,

It would be very helpful if you can explain the issue in more details.

My guess is that you are trying to read compound data with structure
of {string, int} from HDF5 file into Java hashtable. Similar for writing.

Currently, there is no direct way to read/write Java hashtable from/to
HDF5
files. There are a couple of ways that you can do.
A) Implement your own Java and JNI write functions that can read/write
hashtable, or
B) Use H5CompoundDS to read data from file and create the hashtable
once you have the data in memory, similar for write.

Option A) will have better performance but it will need more work. Option
B)
does not need much work but it requires more memory and extra data
conversion.

Thanks
--pc

On 7/18/2011 1:21 PM, kim nguyen wrote:

I am totally newbie with HDF.

I want to use HDF5 to serialize a hashTable. This hashTable is to
large to be managed in memory and i want use HDF for read performance.

The hash table have a String for key and a object (1 string and 2
int) for value.

Have you a java sample code for create this structure, write data and
read
data.

Best regards

kim

France

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

gheber · July 19, 2011, 4:26pm

Kim, how are you? Out of curiosity, I would like to clarify something.

My naïve impression was that, if your keys are French words, the dataset
can't be very large.
Let's say that a highly educated English speaker has about 300,000 words in
his/her vocabulary.
Let's give a French speaker 500,000 words. Add two integers (8 bytes per
entry) as payload.
Now, I don't know what the histogram for French word length looks like.
Most words are probably less than 16-20 characters long (with a long flat
tail).
So even at 100 bytes per entry, we'd be looking at a ~100 MB hash table.
(Even on a smart phone that's not a terrible size.)
No matter how you organize the dataset(s) in HDF5 on disk, you are not going
to beat an
in-memory hashtable/dictionary lookup (even with a generic hash function
that doesn't speak French).
Why can't you hold that in memory?

Is your question that you don't want to regenerate that hashtable all the
time,
and that's why you'd like to store it on disk in HDF5?
(Again, HDF5 has no DBMS like query engine and I don't see why you'd need
that.)

Best, G.

kim · July 19, 2011, 5:31pm

Hi Peter,

Kim, how are you? Out of curiosity, I would like to clarify something.

I am a French man working in IT.

I am working on parallel implementation of CRF
(http://en.wikipedia.org/wiki/Conditional_random_field\) learning (for
my fun only).

My naïve impression was that, if your keys are French words, the dataset
can't be very large.
Let's say that a highly educated English speaker has about 300,000 words in
his/her vocabulary.
Let's give a French speaker 500,000 words. Add two integers (8 bytes per
entry) as payload.

In this hashTable i have one entry bye feature in my corpus and my
corpus is huge an extract of the French wikipedia.
The example en my last email was bad.

in-memory hashtable/dictionary lookup (even with a generic hash function
that doesn't speak French).
Why can't you hold that in memory?

My original implementation use a memory hasttable, but this hashtable
use more than my 12 Go of memory.
My second implementation use a cached hastable with "ehcache", with
this cache a can manage the memory but the read performance was very
bad. This learning take more than 10 days with two cluster node.

Is your question that you don't want to regenerate that hashtable all the
time,
and that's why you'd like to store it on disk in HDF5?
(Again, HDF5 has no DBMS like query engine and I don't see why you'd need
that.)

My issue is not a serialization trouble but a trouble for read a this
hashtable when it was to large for memory.
I try DBMS but the performance where more bad than cached hashTable,
index on the word is not discriminant.

I hope that a binary file, help me.

I want to have a java object with a Add and GetValueByKey method, the
more important for me is the read performance, this structure will be
read for very long time (more 10 days, with memory hashtable).

Thanks for your help and excuse my bad English

Regards

···

2011/7/19 Gerd Heber <gheber@hdfgroup.org>:

Kim, how are you? Out of curiosity, I would like to clarify something.

My naïve impression was that, if your keys are French words, the dataset
can't be very large.
Let's say that a highly educated English speaker has about 300,000 words in
his/her vocabulary.
Let's give a French speaker 500,000 words. Add two integers (8 bytes per
entry) as payload.
Now, I don't know what the histogram for French word length looks like.
Most words are probably less than 16-20 characters long (with a long flat
tail).
So even at 100 bytes per entry, we'd be looking at a ~100 MB hash table.
(Even on a smart phone that's not a terrible size.)
No matter how you organize the dataset(s) in HDF5 on disk, you are not going
to beat an
in-memory hashtable/dictionary lookup (even with a generic hash function
that doesn't speak French).
Why can't you hold that in memory?

Is your question that you don't want to regenerate that hashtable all the
time,
and that's why you'd like to store it on disk in HDF5?
(Again, HDF5 has no DBMS like query engine and I don't see why you'd need
that.)

Best, G.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF JAVA AND HASHTABLE