working with HDF5 tables

sarah.s.jaworski · December 18, 2015, 5:12pm

I am writing a python script to write a table to hdf5 file. Based off some quick googling, using the pandas library seemed like an easy way to accomplish this. The code is as follows. The method is called in a loop, sending data to it in sections since all the data cannot be stored in memory at the same time (hence, the 'first_time' flag):

def write_to_hdf(data, filename, first_time):
from pandas import DataFrame
data_frame = DataFrame.from_dict(data)

    # save to hdf5
    if first_time == True:
        data_frame.to_hdf(filename, 'data', mode='w', format='table', append=True)
    else:
        data_frame.to_hdf(filename, 'data', append=True)

# allow data frame to be garbage collected
del data_frame

This seems to work fine. However, upon inspecting the HDF5 file, I saw some things that I didn't expect. Having never worked with HDF5 tables before, I expected to see a dataset named 'data' with a compound type that contained a member for each each field in my data frame. My example table has 13,403 rows and three columns: TIME, $EP, and $SYSID. The HDF5 file looks like this when using h5disp from Matlab:

h5disp('C:\Data\hdf-export.h5')

HDF5 hdf-export.h5
Group '/'
    Attributes:
        'TITLE': ''
        'CLASS': 'GROUP'
        'VERSION': '1.0'
        'PYTABLES_FORMAT_VERSION': '2.1'
    Group '/data'
        Attributes:
            'TITLE': ''
            'CLASS': 'GROUP'
            'VERSION': '1.0'
            'pandas_type': 'frame_table'
            'pandas_version': '0.10.1'
            'table_type': 'appendable_frame'
            'index_cols': '(lp1
(I0
S'index'
p2
tp3
a.'
            'values_cols': '(lp1
S'values_block_0'
p2
aS'values_block_1'
p3
a.'
            'non_index_axes': '(lp1
(I1
(lp2
S'$EP'
p3
aS'$SYSID'
p4
aS'TIME'
p5
atp6
a.'
            'data_columns': '(lp1
.'
            'nan_rep': 'nan'
            'encoding': 'N.'
            'levels': 1
            'info': '(dp1
I1
(dp2
S'type'
p3
S'Index'
p4
sS'names'
p5
(lp6
NassS'index'
p7
(dp8
s.'
        Dataset 'table'
            Size: 13403
            MaxSize: Inf
            Datatype: H5T_COMPOUND
                Member 'index': H5T_STD_I64LE (int64)
                Member 'values_block_0': H5T_ARRAY
                    Size: 1
                    Base Type: H5T_IEEE_F64LE (double)
                Member 'values_block_1': H5T_ARRAY
                    Size: 2
                    Base Type: H5T_STD_I64LE (int64)
            ChunkSize: 2048
            Filters: none
            Attributes:
                'CLASS': 'TABLE'
                'VERSION': '2.7'
                'TITLE': ''
                'FIELD_0_NAME': 'index'
                'FIELD_1_NAME': 'values_block_0'
                'FIELD_2_NAME': 'values_block_1'
                'FIELD_0_FILL': 0
                'FIELD_1_FILL': 0.000000
                'FIELD_2_FILL': 0
                'index_kind': 'integer'
                'values_block_0_kind': '(lp1
S'TIME'
p2
a.'
                'values_block_0_dtype': 'float64'
                'values_block_1_kind': '(lp1
S'$EP'
p2
aS'$SYSID'
p3
a.'
                'values_block_1_dtype': 'int64'
                'NROWS': 13403
        Group '/data/_i_table'
            Attributes:
                'TITLE': 'Indexes container for table /data/table'
                'CLASS': 'TINDEX'
                'VERSION': '1.0'
            Group '/data/_i_table/index'
                Attributes:
                    'TITLE': 'Index for index column'
                    'CLASS': 'INDEX'
                   'VERSION': '2.1'
                    'FILTERS': 65793
                    'superblocksize': 262144
                    'blocksize': 131072
                    'slicesize': 131072
                    'chunksize': 1024
                    'optlevel': 6
                    'reduction': 1
                    'DIRTY': 0
                Dataset 'abounds'
                    Size: 0
                    MaxSize: Inf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 8192
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'EARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Start bounds'
                        'EXTDIM': 0
                Dataset 'bounds'
                    Size: 127x0
                    MaxSize: 127xInf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 127x1
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'CACHEARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Boundary Values'
                        'EXTDIM': 0
                Dataset 'indices'
                    Size: 131072x0
                    MaxSize: 131072xInf
                    Datatype: H5T_STD_U32LE (uint32)
                    ChunkSize: 1024x1
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'INDEXARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Number of chunk in table'
                        'EXTDIM': 0
                Dataset 'indicesLR'
                    Size: 131072
                    MaxSize: 131072
                    Datatype: H5T_STD_U32LE (uint32)
                    ChunkSize: 1024
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'LASTROWARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Last Row indices'
                        'nelements': 13403
                Dataset 'mbounds'
                    Size: 0
                    MaxSize: Inf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 8192
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'EARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Median bounds'
                        'EXTDIM': 0
                Dataset 'mranges'
                    Size: 0
                    MaxSize: Inf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 8192
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'EARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Median ranges'
                        'EXTDIM': 0
                Dataset 'ranges'
                    Size: 2x0
                    MaxSize: 2xInf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 2x4096
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'CACHEARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Range Values'
                        'EXTDIM': 0
                Dataset 'sorted'
                    Size: 131072x0
                    MaxSize: 131072xInf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 1024x1
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'INDEXARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Sorted Values'
                        'EXTDIM': 0
                Dataset 'sortedLR'
                    Size: 131201
                    MaxSize: 131201
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 1024
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'LASTROWARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'Last Row sorted values + bounds'
                        'nelements': 13403
                Dataset 'zbounds'
                    Size: 0
                    MaxSize: Inf
                    Datatype: H5T_STD_I64LE (int64)
                    ChunkSize: 8192
                    Filters: shuffle, deflate(1)
                    Attributes:
                        'CLASS': 'EARRAY'
                        'VERSION': '1.1'
                        'TITLE': 'End bounds'
                        'EXTDIM': 0

I see that /data/table has two arrays that hold my data values. However, they are not named after the fields in my data frame. I need to be able to read the resulting HDF5 file from Matlab. I also need to be able to use the HDF5 Java object API to read this data for a separate application that I maintain. I don't see a way to even figure out what the fieldnames in my original dataset are. I see them embedded in some attributes within a larger string, but nothing straightforward. In the HDF C API, I see H5TB methods like H5TBread_fields_name, which seem like they would do this. I don't see an equivalent API in Java. I also don't see anything in Matlab's documentation. (I'm using Matlab R2012b.)

Any help in trying to read this table from the HDF5 correctly in Matlab and/or from the Java object API is appreciated.

Thank you.

m.jasperse · December 18, 2015, 8:09pm

Hi there,
I just want to point out that PANDAS uses the PyTables format (
http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) which is a
particular structure of storing data that uses the HDF5 as a container but
imposes its own specific layout. This is non-trivial to read in other
programs, particularly MATLAB since it only implements a subset of the HDF5
functionality.

I highly recommend the excellent h5py project (http://www.h5py.org/) which
provides a simple, direct way to read/write datasets in python and access
them from other programs. It provides the convenient create_dataset()
function that probably does what you want.

I use PANDAS for data analysis, but h5py for IO for this reason.

Cheers,
Martijn

···

On 18 December 2015 at 18:12, Jaworski, Sarah S <sarah.s.jaworski@lmco.com> wrote:

I am writing a python script to write a table to hdf5 file. Based off
some quick googling, using the pandas library seemed like an easy way to
accomplish this. The code is as follows. The method is called in a loop,
sending data to it in sections since all the data cannot be stored in
memory at the same time (hence, the ‘first_time’ flag):

def *write_to_hdf*(data, filename, first_time):

    from pandas import DataFrame

    data_frame = DataFrame.from_dict(data)

    # save to hdf5

    if first_time == True:

        data_frame.to_hdf(filename, *'data'*, mode=*'w'*, format=*'table'*,
append=True)

    else:

        data_frame.to_hdf(filename, *'data'*, append=True)

    # allow data frame to be garbage collected

    del data_frame

This seems to work fine. However, upon inspecting the HDF5 file, I saw
some things that I didn’t expect. Having never worked with HDF5 tables
before, I expected to see a dataset named ‘data’ with a compound type that
contained a member for each each field in my data frame. My example table
has 13,403 rows and three columns: TIME, $EP, and $SYSID. The HDF5 file
looks like this when using h5disp from Matlab:

>> h5disp('C:\Data\hdf-export.h5')

HDF5 hdf-export.h5

Group '/'

    Attributes:

        'TITLE': ''

        'CLASS': 'GROUP'

        'VERSION': '1.0'

        'PYTABLES_FORMAT_VERSION': '2.1'

    Group '/data'

        Attributes:

            'TITLE': ''

            'CLASS': 'GROUP'

            'VERSION': '1.0'

            'pandas_type': 'frame_table'

            'pandas_version': '0.10.1'

            'table_type': 'appendable_frame'

            'index_cols': '(lp1

(I0

S'index'

p2

tp3

a.'

            'values_cols': '(lp1

S'values_block_0'

p2

aS'values_block_1'

p3

a.'

            'non_index_axes': '(lp1

(I1

(lp2

S'$EP'

p3

aS'$SYSID'

p4

aS'TIME'

p5

atp6

a.'

            'data_columns': '(lp1

.'

            'nan_rep': 'nan'

            'encoding': 'N.'

            'levels': 1

            'info': '(dp1

I1

(dp2

S'type'

p3

S'Index'

p4

sS'names'

p5

(lp6

NassS'index'

p7

(dp8

s.'

        Dataset 'table'

            Size: 13403

            MaxSize: Inf

            Datatype: H5T_COMPOUND

                Member 'index': H5T_STD_I64LE (int64)

                Member 'values_block_0': H5T_ARRAY

                    Size: 1

                    Base Type: H5T_IEEE_F64LE (double)

                Member 'values_block_1': H5T_ARRAY

                    Size: 2

                    Base Type: H5T_STD_I64LE (int64)

            ChunkSize: 2048

            Filters: none

            Attributes:

                'CLASS': 'TABLE'

                'VERSION': '2.7'

                'TITLE': ''

                'FIELD_0_NAME': 'index'

                'FIELD_1_NAME': 'values_block_0'

                'FIELD_2_NAME': 'values_block_1'

                'FIELD_0_FILL': 0

                'FIELD_1_FILL': 0.000000

                'FIELD_2_FILL': 0

                'index_kind': 'integer'

                'values_block_0_kind': '(lp1

S'TIME'

p2

a.'

                'values_block_0_dtype': 'float64'

                'values_block_1_kind': '(lp1

S'$EP'

p2

aS'$SYSID'

p3

a.'

                'values_block_1_dtype': 'int64'

                'NROWS': 13403

        Group '/data/_i_table'

            Attributes:

                'TITLE': 'Indexes container for table /data/table'

                'CLASS': 'TINDEX'

                'VERSION': '1.0'

            Group '/data/_i_table/index'

                Attributes:

                    'TITLE': 'Index for index column'

                    'CLASS': 'INDEX'

                   'VERSION': '2.1'

                    'FILTERS': 65793

                    'superblocksize': 262144

                    'blocksize': 131072

                    'slicesize': 131072

                    'chunksize': 1024

                    'optlevel': 6

                    'reduction': 1

                    'DIRTY': 0

                Dataset 'abounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Start bounds'

                        'EXTDIM': 0

                Dataset 'bounds'

                    Size: 127x0

                    MaxSize: 127xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 127x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Boundary Values'

                        'EXTDIM': 0

                Dataset 'indices'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Number of chunk in table'

                        'EXTDIM': 0

                Dataset 'indicesLR'

                    Size: 131072

                    MaxSize: 131072

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row indices'

                        'nelements': 13403

                Dataset 'mbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median bounds'

                        'EXTDIM': 0

                Dataset 'mranges'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median ranges'

                        'EXTDIM': 0

                Dataset 'ranges'

                    Size: 2x0

                    MaxSize: 2xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 2x4096

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Range Values'

                        'EXTDIM': 0

                Dataset 'sorted'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Sorted Values'

                        'EXTDIM': 0

                Dataset 'sortedLR'

                    Size: 131201

                    MaxSize: 131201

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row sorted values + bounds'

                        'nelements': 13403

                Dataset 'zbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'End bounds'

                        'EXTDIM': 0

I see that /data/table has two arrays that hold my data values. However,
they are not named after the fields in my data frame. I need to be able to
read the resulting HDF5 file from Matlab. I also need to be able to use
the HDF5 Java object API to read this data for a separate application that
I maintain. I don’t see a way to even figure out what the fieldnames in my
original dataset are. I see them embedded in some attributes within a
larger string, but nothing straightforward. In the HDF C API, I see H5TB
methods like H5TBread_fields_name, which seem like they would do this. I
don’t see an equivalent API in Java. I also don’t see anything in Matlab’s
documentation. (I’m using Matlab R2012b.)

Any help in trying to read this table from the HDF5 correctly in Matlab
and/or from the Java object API is appreciated.

Thank you.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

faltet · December 21, 2015, 9:16am

Hi Sarah,

Pandas uses the so-called 'fixed' format (
http://pandas.pydata.org/pandas-docs/stable/io.html#fixed-format) by
default, which, although HDF5, it creates a quite complex structure
indeed. I suggest you to try the 'table' format (
http://pandas.pydata.org/pandas-docs/stable/io.html#table-format) instead.
Also, you won't need PyTables indexes (a way to accelerate queries in HDF5
tables) for MATLAB, so better disable them.

Here it is an example that creates a pure HDF5 table (compound type
dataset) that you should be able to read with MATLAB (apparently compound
datatypes are supported there:
http://es.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html
):

"""# prova.py file
import pandas as pd

pd.set_option('io.hdf.default_format', 'table')

with pd.HDFStore('store3.h5', index=False) as store:
    df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
    store.append('df', df, index=False)
    print(repr(store))
"""

$ python prova.py

File path: store3.h5

/df frame_table
(typ->appendable,nrows->4,ncols->2,indexers->[index])

$ h5ls -rd store3.h5

/ Group

/df Group

/df/table Dataset {2/Inf}

Data:

(0) {0, [1,2]}, {1, [3,4]}

Hope this helps,

Francesc

···

2015-12-18 18:12 GMT+01:00 Jaworski, Sarah S <sarah.s.jaworski@lmco.com>:

I am writing a python script to write a table to hdf5 file. Based off
some quick googling, using the pandas library seemed like an easy way to
accomplish this. The code is as follows. The method is called in a loop,
sending data to it in sections since all the data cannot be stored in
memory at the same time (hence, the ‘first_time’ flag):

def *write_to_hdf*(data, filename, first_time):

    from pandas import DataFrame

    data_frame = DataFrame.from_dict(data)

    # save to hdf5

    if first_time == True:

        data_frame.to_hdf(filename, *'data'*, mode=*'w'*, format=*'table'*,
append=True)

    else:

        data_frame.to_hdf(filename, *'data'*, append=True)

    # allow data frame to be garbage collected

    del data_frame

This seems to work fine. However, upon inspecting the HDF5 file, I saw
some things that I didn’t expect. Having never worked with HDF5 tables
before, I expected to see a dataset named ‘data’ with a compound type that
contained a member for each each field in my data frame. My example table
has 13,403 rows and three columns: TIME, $EP, and $SYSID. The HDF5 file
looks like this when using h5disp from Matlab:

>> h5disp('C:\Data\hdf-export.h5')

HDF5 hdf-export.h5

Group '/'

    Attributes:

        'TITLE': ''

        'CLASS': 'GROUP'

        'VERSION': '1.0'

        'PYTABLES_FORMAT_VERSION': '2.1'

    Group '/data'

        Attributes:

            'TITLE': ''

            'CLASS': 'GROUP'

            'VERSION': '1.0'

            'pandas_type': 'frame_table'

            'pandas_version': '0.10.1'

            'table_type': 'appendable_frame'

            'index_cols': '(lp1

(I0

S'index'

p2

tp3

a.'

            'values_cols': '(lp1

S'values_block_0'

p2

aS'values_block_1'

p3

a.'

            'non_index_axes': '(lp1

(I1

(lp2

S'$EP'

p3

aS'$SYSID'

p4

aS'TIME'

p5

atp6

a.'

            'data_columns': '(lp1

.'

            'nan_rep': 'nan'

            'encoding': 'N.'

            'levels': 1

            'info': '(dp1

I1

(dp2

S'type'

p3

S'Index'

p4

sS'names'

p5

(lp6

NassS'index'

p7

(dp8

s.'

        Dataset 'table'

            Size: 13403

            MaxSize: Inf

            Datatype: H5T_COMPOUND

                Member 'index': H5T_STD_I64LE (int64)

                Member 'values_block_0': H5T_ARRAY

                    Size: 1

                    Base Type: H5T_IEEE_F64LE (double)

                Member 'values_block_1': H5T_ARRAY

                    Size: 2

                    Base Type: H5T_STD_I64LE (int64)

            ChunkSize: 2048

            Filters: none

            Attributes:

                'CLASS': 'TABLE'

                'VERSION': '2.7'

                'TITLE': ''

                'FIELD_0_NAME': 'index'

                'FIELD_1_NAME': 'values_block_0'

                'FIELD_2_NAME': 'values_block_1'

                'FIELD_0_FILL': 0

                'FIELD_1_FILL': 0.000000

                'FIELD_2_FILL': 0

                'index_kind': 'integer'

                'values_block_0_kind': '(lp1

S'TIME'

p2

a.'

                'values_block_0_dtype': 'float64'

                'values_block_1_kind': '(lp1

S'$EP'

p2

aS'$SYSID'

p3

a.'

                'values_block_1_dtype': 'int64'

                'NROWS': 13403

        Group '/data/_i_table'

            Attributes:

                'TITLE': 'Indexes container for table /data/table'

                'CLASS': 'TINDEX'

                'VERSION': '1.0'

            Group '/data/_i_table/index'

                Attributes:

                    'TITLE': 'Index for index column'

                    'CLASS': 'INDEX'

                   'VERSION': '2.1'

                    'FILTERS': 65793

                    'superblocksize': 262144

                    'blocksize': 131072

                    'slicesize': 131072

                    'chunksize': 1024

                    'optlevel': 6

                    'reduction': 1

                    'DIRTY': 0

                Dataset 'abounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Start bounds'

                        'EXTDIM': 0

                Dataset 'bounds'

                    Size: 127x0

                    MaxSize: 127xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 127x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Boundary Values'

                        'EXTDIM': 0

                Dataset 'indices'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Number of chunk in table'

                        'EXTDIM': 0

                Dataset 'indicesLR'

                    Size: 131072

                    MaxSize: 131072

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row indices'

                        'nelements': 13403

                Dataset 'mbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median bounds'

                        'EXTDIM': 0

                Dataset 'mranges'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median ranges'

                        'EXTDIM': 0

                Dataset 'ranges'

                    Size: 2x0

                    MaxSize: 2xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 2x4096

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Range Values'

                        'EXTDIM': 0

                Dataset 'sorted'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Sorted Values'

                        'EXTDIM': 0

                Dataset 'sortedLR'

                    Size: 131201

                    MaxSize: 131201

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row sorted values + bounds'

                        'nelements': 13403

                Dataset 'zbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'End bounds'

                        'EXTDIM': 0

I see that /data/table has two arrays that hold my data values. However,
they are not named after the fields in my data frame. I need to be able to
read the resulting HDF5 file from Matlab. I also need to be able to use
the HDF5 Java object API to read this data for a separate application that
I maintain. I don’t see a way to even figure out what the fieldnames in my
original dataset are. I see them embedded in some attributes within a
larger string, but nothing straightforward. In the HDF C API, I see H5TB
methods like H5TBread_fields_name, which seem like they would do this. I
don’t see an equivalent API in Java. I also don’t see anything in Matlab’s
documentation. (I’m using Matlab R2012b.)

Any help in trying to read this table from the HDF5 correctly in Matlab
and/or from the Java object API is appreciated.

Thank you.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Francesc Alted

faltet · December 21, 2015, 11:03am

Oops, for that to work, you also need `data_columns=True`. With that, you
don't need to specify the 'table' format either. Here it is a working
example:

"""# prova.py file
import pandas as pd

with pd.HDFStore('store3.h5', mode='w') as store:
    df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
    store.append('df', df, data_columns=True, index=False)
    print(repr(store))
"""

$ python prova.py

File path: store3.h5

/df frame_table
(typ->appendable,nrows->2,ncols->2,indexers->[index],dc->[A,B])

$ h5ls -rd store3.h5

/ Group

/df Group

/df/table Dataset {2/Inf}

Data:

(0) {0, 1, 2}, {1, 3, 4}

Cheers,

Francesc

···

2015-12-21 10:16 GMT+01:00 Francesc Alted <faltet@gmail.com>:

Hi Sarah,

Pandas uses the so-called 'fixed' format (
http://pandas.pydata.org/pandas-docs/stable/io.html#fixed-format) by
default, which, although HDF5, it creates a quite complex structure
indeed. I suggest you to try the 'table' format (
http://pandas.pydata.org/pandas-docs/stable/io.html#table-format)
instead. Also, you won't need PyTables indexes (a way to accelerate
queries in HDF5 tables) for MATLAB, so better disable them.

Here it is an example that creates a pure HDF5 table (compound type
dataset) that you should be able to read with MATLAB (apparently compound
datatypes are supported there:
http://es.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html
):

"""# prova.py file
import pandas as pd

pd.set_option('io.hdf.default_format', 'table')

with pd.HDFStore('store3.h5', index=False) as store:
    df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
    store.append('df', df, index=False)
    print(repr(store))
"""

$ python prova.py

<class 'pandas.io.pytables.HDFStore'>

File path: store3.h5

/df frame_table
(typ->appendable,nrows->4,ncols->2,indexers->[index])

$ h5ls -rd store3.h5

/ Group

/df Group

/df/table Dataset {2/Inf}

    Data:

        (0) {0, [1,2]}, {1, [3,4]}

Hope this helps,

Francesc

2015-12-18 18:12 GMT+01:00 Jaworski, Sarah S <sarah.s.jaworski@lmco.com>:

I am writing a python script to write a table to hdf5 file. Based off
some quick googling, using the pandas library seemed like an easy way to
accomplish this. The code is as follows. The method is called in a loop,
sending data to it in sections since all the data cannot be stored in
memory at the same time (hence, the ‘first_time’ flag):

def *write_to_hdf*(data, filename, first_time):

    from pandas import DataFrame

    data_frame = DataFrame.from_dict(data)

    # save to hdf5

    if first_time == True:

        data_frame.to_hdf(filename, *'data'*, mode=*'w'*, format=
*'table'*, append=True)

    else:

        data_frame.to_hdf(filename, *'data'*, append=True)

    # allow data frame to be garbage collected

    del data_frame

This seems to work fine. However, upon inspecting the HDF5 file, I saw
some things that I didn’t expect. Having never worked with HDF5 tables
before, I expected to see a dataset named ‘data’ with a compound type that
contained a member for each each field in my data frame. My example table
has 13,403 rows and three columns: TIME, $EP, and $SYSID. The HDF5 file
looks like this when using h5disp from Matlab:

>> h5disp('C:\Data\hdf-export.h5')

HDF5 hdf-export.h5

Group '/'

    Attributes:

        'TITLE': ''

        'CLASS': 'GROUP'

        'VERSION': '1.0'

        'PYTABLES_FORMAT_VERSION': '2.1'

    Group '/data'

        Attributes:

            'TITLE': ''

            'CLASS': 'GROUP'

            'VERSION': '1.0'

            'pandas_type': 'frame_table'

            'pandas_version': '0.10.1'

            'table_type': 'appendable_frame'

            'index_cols': '(lp1

(I0

S'index'

p2

tp3

a.'

            'values_cols': '(lp1

S'values_block_0'

p2

aS'values_block_1'

p3

a.'

            'non_index_axes': '(lp1

(I1

(lp2

S'$EP'

p3

aS'$SYSID'

p4

aS'TIME'

p5

atp6

a.'

            'data_columns': '(lp1

.'

            'nan_rep': 'nan'

            'encoding': 'N.'

            'levels': 1

            'info': '(dp1

I1

(dp2

S'type'

p3

S'Index'

p4

sS'names'

p5

(lp6

NassS'index'

p7

(dp8

s.'

        Dataset 'table'

            Size: 13403

            MaxSize: Inf

            Datatype: H5T_COMPOUND

                Member 'index': H5T_STD_I64LE (int64)

                Member 'values_block_0': H5T_ARRAY

                    Size: 1

                    Base Type: H5T_IEEE_F64LE (double)

                Member 'values_block_1': H5T_ARRAY

                    Size: 2

                    Base Type: H5T_STD_I64LE (int64)

            ChunkSize: 2048

            Filters: none

            Attributes:

                'CLASS': 'TABLE'

                'VERSION': '2.7'

                'TITLE': ''

                'FIELD_0_NAME': 'index'

                'FIELD_1_NAME': 'values_block_0'

                'FIELD_2_NAME': 'values_block_1'

                'FIELD_0_FILL': 0

                'FIELD_1_FILL': 0.000000

                'FIELD_2_FILL': 0

                'index_kind': 'integer'

                'values_block_0_kind': '(lp1

S'TIME'

p2

a.'

                'values_block_0_dtype': 'float64'

                'values_block_1_kind': '(lp1

S'$EP'

p2

aS'$SYSID'

p3

a.'

                'values_block_1_dtype': 'int64'

                'NROWS': 13403

        Group '/data/_i_table'

            Attributes:

                'TITLE': 'Indexes container for table /data/table'

                'CLASS': 'TINDEX'

                'VERSION': '1.0'

            Group '/data/_i_table/index'

                Attributes:

                    'TITLE': 'Index for index column'

                    'CLASS': 'INDEX'

                   'VERSION': '2.1'

                    'FILTERS': 65793

                    'superblocksize': 262144

                    'blocksize': 131072

                    'slicesize': 131072

                    'chunksize': 1024

                    'optlevel': 6

                    'reduction': 1

                    'DIRTY': 0

                Dataset 'abounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Start bounds'

                        'EXTDIM': 0

                Dataset 'bounds'

                    Size: 127x0

                    MaxSize: 127xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 127x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Boundary Values'

                        'EXTDIM': 0

                Dataset 'indices'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Number of chunk in table'

                        'EXTDIM': 0

                Dataset 'indicesLR'

                    Size: 131072

                    MaxSize: 131072

                    Datatype: H5T_STD_U32LE (uint32)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row indices'

                        'nelements': 13403

                Dataset 'mbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median bounds'

                        'EXTDIM': 0

                Dataset 'mranges'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Median ranges'

                        'EXTDIM': 0

                Dataset 'ranges'

                    Size: 2x0

                    MaxSize: 2xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 2x4096

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'CACHEARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Range Values'

                        'EXTDIM': 0

                Dataset 'sorted'

                    Size: 131072x0

                    MaxSize: 131072xInf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024x1

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'INDEXARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Sorted Values'

                        'EXTDIM': 0

                Dataset 'sortedLR'

                    Size: 131201

                    MaxSize: 131201

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 1024

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'LASTROWARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'Last Row sorted values + bounds'

                        'nelements': 13403

                Dataset 'zbounds'

                    Size: 0

                    MaxSize: Inf

                    Datatype: H5T_STD_I64LE (int64)

                    ChunkSize: 8192

                    Filters: shuffle, deflate(1)

                    Attributes:

                        'CLASS': 'EARRAY'

                        'VERSION': '1.1'

                        'TITLE': 'End bounds'

                        'EXTDIM': 0

I see that /data/table has two arrays that hold my data values. However,
they are not named after the fields in my data frame. I need to be able to
read the resulting HDF5 file from Matlab. I also need to be able to use
the HDF5 Java object API to read this data for a separate application that
I maintain. I don’t see a way to even figure out what the fieldnames in my
original dataset are. I see them embedded in some attributes within a
larger string, but nothing straightforward. In the HDF C API, I see H5TB
methods like H5TBread_fields_name, which seem like they would do this. I
don’t see an equivalent API in Java. I also don’t see anything in Matlab’s
documentation. (I’m using Matlab R2012b.)

Any help in trying to read this table from the HDF5 correctly in Matlab
and/or from the Java object API is appreciated.

Thank you.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
Francesc Alted

--
Francesc Alted

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

working with HDF5 tables