Why is the chunk_size argument for H5TBmake_table() of a scalar type (hsize_t) ?

Hello,

My program that produces the data is written in c++ and I should be able to link high level C API library H5TB.
The table has mixed types of ‘fields’ (or columns) (e.g. uint64, float etc). The table can have up to 80 columns and 10000 rows, but it should be able to scale to larger dimensions. The dimension of the table is fixed for each production of the table.

On the consumer side, I’m considering PyPandas or PyTables. The program on consumer side needs to apply numeric functions along each column, so storing the columns rather than the rows in continuous space is much more efficient for the consumer side program. Performance is more critical on consumer side as it is aggregating output from multiple producer programs.

I’m considering H5TB high level API, along with the block write (i.e. write fixed number of rows) approach proposed by Darryl in this forum http://hdf-forum.184993.n3.nabble.com/hdf-forum-Efficient-Way-to-Write-Compound-Data-td193448.html#a193447.

On a write to a file, I would like the column rather than the row for each block to be stored in contiguous memory in the H5 file, assuming that this will help with performance when PyTables on consumer side accesses the table by column (not by row).

Th example codes on Chunking http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/ shows that chunk_dims[2] has 2 elements. For example, if the block has 1000 rows, I would use chunk_dim[2] = {1000, 1} so that the 1000 rows for each column is stored in contiguous piece or memory.

Does H5TBmake_table() support such chunking dimension and if so, what is the syntax that I would use ?

Thanks!

Hi Joe,

You are out of luck here….

HDF5 TB interface treats a table as a 1-dim array of elements that have a compound datatype (i.e., table fields are the fields of the HDF5 compound datatype). One cannot have a chunk that is a column of the table; the chunk always contains several rows of the table. If you want to use PyTables, you may take a look at their array object instead of the table to store each column independently.

Do you have any data that suggests that reading by a column from the table is really slow? If so, you should experiment with the chunk size and chunk cache size to tune the performance.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Sep 2, 2014, at 3:22 PM, J. Lee <techpubsub@gmail.com<mailto:techpubsub@gmail.com>> wrote:

Hello,

My program that produces the data is written in c++ and I should be able to link high level C API library H5TB.
The table has mixed types of ‘fields’ (or columns) (e.g. uint64, float etc). The table can have up to 80 columns and 10000 rows, but it should be able to scale to larger dimensions. The dimension of the table is fixed for each production of the table.

On the consumer side, I’m considering PyPandas or PyTables. The program on consumer side needs to apply numeric functions along each column, so storing the columns rather than the rows in continuous space is much more efficient for the consumer side program. Performance is more critical on consumer side as it is aggregating output from multiple producer programs.

I’m considering H5TB high level API, along with the block write (i.e. write fixed number of rows) approach proposed by Darryl in this forum http://hdf-forum.184993.n3.nabble.com/hdf-forum-Efficient-Way-to-Write-Compound-Data-td193448.html#a193447.

On a write to a file, I would like the column rather than the row for each block to be stored in contiguous memory in the H5 file, assuming that this will help with performance when PyTables on consumer side accesses the table by column (not by row).

Th example codes on Chunking http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/ shows that chunk_dims[2] has 2 elements. For example, if the block has 1000 rows, I would use chunk_dim[2] = {1000, 1} so that the 1000 rows for each column is stored in contiguous piece or memory.

Does H5TBmake_table() support such chunking dimension and if so, what is the syntax that I would use ?

Thanks!
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5