pytables csv conversion problems

sarah.s.jaworski · December 22, 2015, 8:47pm

Hi,
I have a python script that essentially converts a csv file to python, but I have a few problems that I haven't been able to solve.

1. Column Order - I don't know the columns that I need to write until runtime, so creating an extension of the IsDescription class was a non-starter. Therefore, to define by columns, I am passing in a dictionary that maps column name to column class to the create_table method:

      h5_file = tables.open_file(filename, mode = 'w', title = 'Test File')
      group = h5_file.create_group('/', 'data', 'Data Group')
      column_dict = OrderedDict()
      for key in column_names:
        column_dict[key] = create_col(key)

table = h5_file.create_table(group, 'table', column_dict, 'Table')

create_col is simply a method that returns Int32Col(), Float64Col(), etc., depending on some information about the column. That is working fine. However, the columns in the table that are created are not in the order that I want. I used OrderedDict to ensure that the columns are in the dictionary in insertion order, but the table doesn't reflect this. Any ideas on how to control the column order if I can't extend IsDescription to create my data type?

2. Variable length strings - Strings work fine when I give them a maximum size. This was fine to get something up and running, but the strings really need to be variable length. Is there a way to have VLString columns within a table? I see examples of VLStringAtom being passed as a type to h5file.create_array, but I don't see similar examples for table columns and there isn't a Col class for this type. Any help is appreciated.

3. "Blanks" in my csv file - The csv files I'm converting contain null or blank values. If you imagine loading the file in Excel or a similar program, some cells will be blank. So, even if column X is an Int32Col, there may be blanks. How would I handle this using PyTables? I suppose I can substitute some value for blank cells, but I would like to avoid that if possible.

Help on any of these items is greatly appreciated. I know that using h5py (would I have to use the low-level API?) instead of pytables would probably solve these problems, but am trying to avoid that since pytables has otherwise been so easy to use.

Thanks in advance,
Sarah

faltet · December 22, 2015, 10:56pm

Hi,

I have a python script that essentially converts a csv file to python, but
I have a few problems that I haven’t been able to solve.

1. Column Order – I don’t know the columns that I need to write
until runtime, so creating an extension of the IsDescription class was a
non-starter. Therefore, to define by columns, I am passing in a dictionary
that maps column name to column class to the create_table method:

      h5_file = tables.open_file(filename, mode = *'w'*, title = *'Test
File'*)

      group = h5_file.create_group(*'/'*, *'data'*, *'Data Group'*)

      column_dict = OrderedDict()

      for key in column_names:

        column_dict[key] = create_col(key)

table = h5_file.create_table(group, *'table'*, column_dict, *'Table'*)

create_col is simply a method that returns Int32Col(), Float64Col(), etc.,
depending on some information about the column. That is working fine.
However, the columns in the table that are created are not in the order
that I want. I used OrderedDict to ensure that the columns are in the
dictionary in insertion order, but the table doesn’t reflect this. Any
ideas on how to control the column order if I can’t extend IsDescription to
create my data type?

Yes. The `col` parameter of IsDescription is your friend here. See an
example here:

http://www.pytables.org/usersguide/libref/structured_storage.html#table-methods-writing

2. Variable length strings – Strings work fine when I give them a
maximum size. This was fine to get something up and running, but the
strings really need to be variable length. Is there a way to have VLString
columns within a table? I see examples of VLStringAtom being passed as a
type to h5file.create_array, but I don’t see similar examples for table
columns and there isn’t a Col class for this type. Any help is appreciated.

No, PyTables does not have provision for handling variable length strings
in Table instances (datasets with compound objects in HDF5 parlance). The
reason for this is mainly the additional performance overhead that handling
with variable length would require. For the cases where you absolutely
need that the general advice is to have a Table and a separate VLArray
instance(s) with the same order in the row entries. Then it is just a
matter of retrieving items in VLArray instances as needed.

3. “Blanks” in my csv file – The csv files I’m converting contain
null or blank values. If you imagine loading the file in Excel or a
similar program, some cells will be blank. So, even if column X is an
Int32Col, there may be blanks. How would I handle this using PyTables? I
suppose I can substitute some value for blank cells, but I would like to
avoid that if possible.

There are different approaches for this. For example, you can use a NaN
(Not a Number) IEEE representation, but this needs you to use floats
indeed. Another approach would be to use a special value in Int32 that is
not going to match any of your input values (something like -2**31) to
represent a 'NaN'. Handling this special values would be the
responsibility of your code, as PyTables (nor HDF5, I think) does not mess
with that.

···

2015-12-22 21:47 GMT+01:00 Jaworski, Sarah S <sarah.s.jaworski@lmco.com>:

--
Francesc Alted

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

pytables csv conversion problems