I am writing a python script to write a table to hdf5 file. Based off some quick googling, using the pandas library seemed like an easy way to accomplish this. The code is as follows. The method is called in a loop, sending data to it in sections since all the data cannot be stored in memory at the same time (hence, the 'first_time' flag):
def write_to_hdf(data, filename, first_time):
from pandas import DataFrame
data_frame = DataFrame.from_dict(data)
# save to hdf5
if first_time == True:
data_frame.to_hdf(filename, 'data', mode='w', format='table', append=True)
else:
data_frame.to_hdf(filename, 'data', append=True)
# allow data frame to be garbage collected
del data_frame
This seems to work fine. However, upon inspecting the HDF5 file, I saw some things that I didn't expect. Having never worked with HDF5 tables before, I expected to see a dataset named 'data' with a compound type that contained a member for each each field in my data frame. My example table has 13,403 rows and three columns: TIME, $EP, and $SYSID. The HDF5 file looks like this when using h5disp from Matlab:
h5disp('C:\Data\hdf-export.h5')
HDF5 hdf-export.h5
Group '/'
Attributes:
'TITLE': ''
'CLASS': 'GROUP'
'VERSION': '1.0'
'PYTABLES_FORMAT_VERSION': '2.1'
Group '/data'
Attributes:
'TITLE': ''
'CLASS': 'GROUP'
'VERSION': '1.0'
'pandas_type': 'frame_table'
'pandas_version': '0.10.1'
'table_type': 'appendable_frame'
'index_cols': '(lp1
(I0
S'index'
p2
tp3
a.'
'values_cols': '(lp1
S'values_block_0'
p2
aS'values_block_1'
p3
a.'
'non_index_axes': '(lp1
(I1
(lp2
S'$EP'
p3
aS'$SYSID'
p4
aS'TIME'
p5
atp6
a.'
'data_columns': '(lp1
.'
'nan_rep': 'nan'
'encoding': 'N.'
'levels': 1
'info': '(dp1
I1
(dp2
S'type'
p3
S'Index'
p4
sS'names'
p5
(lp6
NassS'index'
p7
(dp8
s.'
Dataset 'table'
Size: 13403
MaxSize: Inf
Datatype: H5T_COMPOUND
Member 'index': H5T_STD_I64LE (int64)
Member 'values_block_0': H5T_ARRAY
Size: 1
Base Type: H5T_IEEE_F64LE (double)
Member 'values_block_1': H5T_ARRAY
Size: 2
Base Type: H5T_STD_I64LE (int64)
ChunkSize: 2048
Filters: none
Attributes:
'CLASS': 'TABLE'
'VERSION': '2.7'
'TITLE': ''
'FIELD_0_NAME': 'index'
'FIELD_1_NAME': 'values_block_0'
'FIELD_2_NAME': 'values_block_1'
'FIELD_0_FILL': 0
'FIELD_1_FILL': 0.000000
'FIELD_2_FILL': 0
'index_kind': 'integer'
'values_block_0_kind': '(lp1
S'TIME'
p2
a.'
'values_block_0_dtype': 'float64'
'values_block_1_kind': '(lp1
S'$EP'
p2
aS'$SYSID'
p3
a.'
'values_block_1_dtype': 'int64'
'NROWS': 13403
Group '/data/_i_table'
Attributes:
'TITLE': 'Indexes container for table /data/table'
'CLASS': 'TINDEX'
'VERSION': '1.0'
Group '/data/_i_table/index'
Attributes:
'TITLE': 'Index for index column'
'CLASS': 'INDEX'
'VERSION': '2.1'
'FILTERS': 65793
'superblocksize': 262144
'blocksize': 131072
'slicesize': 131072
'chunksize': 1024
'optlevel': 6
'reduction': 1
'DIRTY': 0
Dataset 'abounds'
Size: 0
MaxSize: Inf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 8192
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'EARRAY'
'VERSION': '1.1'
'TITLE': 'Start bounds'
'EXTDIM': 0
Dataset 'bounds'
Size: 127x0
MaxSize: 127xInf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 127x1
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'CACHEARRAY'
'VERSION': '1.1'
'TITLE': 'Boundary Values'
'EXTDIM': 0
Dataset 'indices'
Size: 131072x0
MaxSize: 131072xInf
Datatype: H5T_STD_U32LE (uint32)
ChunkSize: 1024x1
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'INDEXARRAY'
'VERSION': '1.1'
'TITLE': 'Number of chunk in table'
'EXTDIM': 0
Dataset 'indicesLR'
Size: 131072
MaxSize: 131072
Datatype: H5T_STD_U32LE (uint32)
ChunkSize: 1024
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'LASTROWARRAY'
'VERSION': '1.1'
'TITLE': 'Last Row indices'
'nelements': 13403
Dataset 'mbounds'
Size: 0
MaxSize: Inf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 8192
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'EARRAY'
'VERSION': '1.1'
'TITLE': 'Median bounds'
'EXTDIM': 0
Dataset 'mranges'
Size: 0
MaxSize: Inf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 8192
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'EARRAY'
'VERSION': '1.1'
'TITLE': 'Median ranges'
'EXTDIM': 0
Dataset 'ranges'
Size: 2x0
MaxSize: 2xInf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 2x4096
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'CACHEARRAY'
'VERSION': '1.1'
'TITLE': 'Range Values'
'EXTDIM': 0
Dataset 'sorted'
Size: 131072x0
MaxSize: 131072xInf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 1024x1
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'INDEXARRAY'
'VERSION': '1.1'
'TITLE': 'Sorted Values'
'EXTDIM': 0
Dataset 'sortedLR'
Size: 131201
MaxSize: 131201
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 1024
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'LASTROWARRAY'
'VERSION': '1.1'
'TITLE': 'Last Row sorted values + bounds'
'nelements': 13403
Dataset 'zbounds'
Size: 0
MaxSize: Inf
Datatype: H5T_STD_I64LE (int64)
ChunkSize: 8192
Filters: shuffle, deflate(1)
Attributes:
'CLASS': 'EARRAY'
'VERSION': '1.1'
'TITLE': 'End bounds'
'EXTDIM': 0
I see that /data/table has two arrays that hold my data values. However, they are not named after the fields in my data frame. I need to be able to read the resulting HDF5 file from Matlab. I also need to be able to use the HDF5 Java object API to read this data for a separate application that I maintain. I don't see a way to even figure out what the fieldnames in my original dataset are. I see them embedded in some attributes within a larger string, but nothing straightforward. In the HDF C API, I see H5TB methods like H5TBread_fields_name, which seem like they would do this. I don't see an equivalent API in Java. I also don't see anything in Matlab's documentation. (I'm using Matlab R2012b.)
Any help in trying to read this table from the HDF5 correctly in Matlab and/or from the Java object API is appreciated.
Thank you.