Using H5TB for large tables

Hi!

The concrete questions are posted below. But first, some details.

I am building a simple 2D table database (where the column data types can
differ) with the high level HDF5 table api (H5TB). The table only has to
support simple operations such as creating/reading/updating/deleting rows
and columns (mainly single row/column based operations).

Since the column data types are not fixed, I am basically manually
constructing the appropriate struct like object (manually computing the
offsets, sizes etc. of the fields) and passing that to the H5TB api.

After testing my implementation I found that some of the single row/column
operations are very slow. I think this is mainly due to two reasons:
1. The H5TB api has to open and close the dataset on every operation
2. It has to construct the record (row) memory data type on every operation

I think the second point severely affects the performance as the number of
columns grows.
For example, for reading single rows:
With 100 columns (compound type with 100 members) I get a speed of ~700
rows/sec.
With 1000 columns, the speed is ~15 rows/sec.
With 10000 columns, I am unable to create the table and get error messages
like: "H5D__update_oh_info(): unable to update datatype header message" and
"H5O_alloc(): object header message is too large".

So, finally, here are my questions:
1. Is my design fundamentally flawed (the H5TB api was not intended for this
purpose?) or am I just doing something wrong?
2. Would I get rid of the performance problems by not to closing the dataset
and not to constructing the record data type on every operation (e.g. write
an optimized version of the H5TB api - something like what the PyTables
library does)?
3. Is there an alternative to constructing a compound data type, so that I
can create tables with a million columns?

Regards,
Reimo Rebane

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Using-H5TB-for-large-tables-tp4025782.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Dear Reimo,

To anwser our first question: I use the H5PT interface quite a lot also on big tables (>> 100,000 rows) without any problems or speed penalty. You have to set the chunk_size size not too small, but remember the chuck_size is scaled with record-size not in bytes. What worries me is that you write "the column data types are not fixed" ?!? What do you mean by that. You are supposed to work on tables, thus all columns have the same number of rows. You can store arrays or compound data in a column, but the record size for each row must be constant. Thus I construct an array of structures and write these in one single operation. This is also the fastest way to read the data.

Greetings, Richard

···

On 01/11/2013 11:54 AM, Reimo Rebane wrote:

Hi!

The concrete questions are posted below. But first, some details.

I am building a simple 2D table database (where the column data types can
differ) with the high level HDF5 table api (H5TB). The table only has to
support simple operations such as creating/reading/updating/deleting rows
and columns (mainly single row/column based operations).

Since the column data types are not fixed, I am basically manually
constructing the appropriate struct like object (manually computing the
offsets, sizes etc. of the fields) and passing that to the H5TB api.

After testing my implementation I found that some of the single row/column
operations are very slow. I think this is mainly due to two reasons:
1. The H5TB api has to open and close the dataset on every operation
2. It has to construct the record (row) memory data type on every operation

I think the second point severely affects the performance as the number of
columns grows.
For example, for reading single rows:
With 100 columns (compound type with 100 members) I get a speed of ~700
rows/sec.
With 1000 columns, the speed is ~15 rows/sec.
With 10000 columns, I am unable to create the table and get error messages
like: "H5D__update_oh_info(): unable to update datatype header message" and
"H5O_alloc(): object header message is too large".

So, finally, here are my questions:
1. Is my design fundamentally flawed (the H5TB api was not intended for this
purpose?) or am I just doing something wrong?
2. Would I get rid of the performance problems by not to closing the dataset
and not to constructing the record data type on every operation (e.g. write
an optimized version of the H5TB api - something like what the PyTables
library does)?
3. Is there an alternative to constructing a compound data type, so that I
can create tables with a million columns?

Regards,
Reimo Rebane

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Using-H5TB-for-large-tables-tp4025782.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi,

Thanks for the answer. I wasn't aware that the chunk_size is scaled with the
record size.

A few clarifications. What I mean by that "the column type is not fixed" is
that the fields (or what I referred to as "columns") of the compound type
are not known at compile time (there is no predefined C struct), but will be
fixed at runtime when the table is created.

The problem I have is when I create a large compound type with a lot of
fields, the performance degrades quite substantially. At some size, it will
even refuse to build this compound type.

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Using-H5TB-for-large-tables-tp4025782p4025795.html
Sent from the hdf-forum mailing list archive at Nabble.com.