CSV data into HDF5 data structure and files

nitinchandra1 · January 29, 2017, 4:47pm

Hi All,

I have been reading a lot and for a long time HDF documentation and
samples (when ever possible), but I guess, till the time I don't wet
my toes, I will not gain eve the basic experience :).

I am on LinuxMint 64 bit OS, with Python and HDF libraries already
installed. I need help / support in coding using python and c++, both,
separately in cli mode.

Background : This data is of a road alignment centre line (File1,
irregular interval), the next data is of Grade file (irregular
interval, File2).

Objective : File1 - File2 = TempFile1

TempFile1 :
containing all Km's in meters {intermediate and regular interval} and
their respective Heights,

Plotting graphically TempFile, by over laying on data of File 1 & 2.

Q) Will my work directory be "/" directory ?

Q) and where or how, do I create File1, File2 and TempFile ?

Q) What parameters do I need to set, to insert/write data into each
file ? and then read from it ?

Q) How do I edit data in the file directly ? Add remove columns in
each file, respectively ?

I do understand, calculation will be done at the code (python or c++) level.

I have sample data, please do let me know when to post it.

Thank you

Nitin Chandra

Francesc_Alted · January 30, 2017, 11:44am

Hi Nitin,

I think before getting into details, you need to look into how to efficiently read and write data from CSV files into HDF5 in Python. For this, pandas is a great library to use. My advice is to have a look at the excellent documentation in pandas website:

http://pandas.pydata.org/pandas-docs/stable/io.html

In particular, you want to use the `pandas.read_csv()` which one of the fastest ways to read CSV files that I am aware of. Also, for storing the data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5 files out of pandas Dataframes. In addition, in order to avoid loading all the data in a Dataframe in memory, you want to use the `chunksize` keyword that will allow to read the CSV files in chunks before storing.

I have prepared an example for you (attached) so that you can have a look at how to use all of this (it is simpler than it may seem). Here it is the output on my machine:

$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)

so, once the data is stored in HDF5, the read times will be much faster than using CSV (as expected).

HTH,

Francesc

csv_demo.py (1.59 KB)

nitinchandra1 · January 30, 2017, 5:31pm

Thank you Francesc,

Please give me 2-3 days try your example ... do some reading and
testes based as per the link mentioned.

I shall repost soon.

Thank you

Nitin

···

On 30 January 2017 at 17:14, Francesc Altet <faltet@hdfgroup.org> wrote:

Hi Nitin,

I think before getting into details, you need to look into how to
efficiently read and write data from CSV files into HDF5 in Python. For
this, pandas is a great library to use. My advice is to have a look at the
excellent documentation in pandas website:

http://pandas.pydata.org/pandas-docs/stable/io.html

In particular, you want to use the `pandas.read_csv()` which one of the
fastest ways to read CSV files that I am aware of. Also, for storing the
data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
files out of pandas Dataframes. In addition, in order to avoid loading all
the data in a Dataframe in memory, you want to use the `chunksize` keyword
that will allow to read the CSV files in chunks before storing.

I have prepared an example for you (attached) so that you can have a look at
how to use all of this (it is simpler than it may seem). Here it is the
output on my machine:

$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)

so, once the data is stored in HDF5, the read times will be much faster than
using CSV (as expected).

HTH,

Francesc

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

nitinchandra1 · February 1, 2017, 7:04pm

Hi Francesc,

I tried your example as it is, could not get time to modify and try
some thing new.

ran the

$ python csv_demo.py

it did create a CSV file with 10 columns, populating the columns with random no.

The demo.h5 was created, and I used HDFView 2.9 to see the contents of
the demo.h5 file.

created were a directory table,

and data table - table.

In the data table - table, there are 2 columns

index | value_block_0

empty | no value
no data | but 10 commas

So that I can relate to your guidance with respect to the issue,
please find attached 2 sample files.
Also, note the first row in CSVs attached, this was created to
initialise the start point of data sequence. Will it be a good
practice to have them in h5 tables also ? Last column has string
values, need them.

ALIGN data goes into file1 and GRADE data into File2, so I am looking
for a write function to write into respective tables and then read
function to read from them.

After the data is in H5 file, can I insert/add/append a new row in
between other rows or at end of file ? Which editor to use or method
to do it in ?

Thank you,

Nitin

ALIGN_NewfmtH5.csv (3.28 KB)

GRAD_newfmtH5.csv (143 Bytes)

···

On 30 January 2017 at 23:01, nitin chandra <nitinchandra1@gmail.com> wrote:

Thank you Francesc,

Please give me 2-3 days try your example ... do some reading and
testes based as per the link mentioned.

I shall repost soon.

Thank you

Nitin

On 30 January 2017 at 17:14, Francesc Altet <faltet@hdfgroup.org> wrote:

Hi Nitin,

I think before getting into details, you need to look into how to
efficiently read and write data from CSV files into HDF5 in Python. For
this, pandas is a great library to use. My advice is to have a look at the
excellent documentation in pandas website:

http://pandas.pydata.org/pandas-docs/stable/io.html

In particular, you want to use the `pandas.read_csv()` which one of the
fastest ways to read CSV files that I am aware of. Also, for storing the
data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
files out of pandas Dataframes. In addition, in order to avoid loading all
the data in a Dataframe in memory, you want to use the `chunksize` keyword
that will allow to read the CSV files in chunks before storing.

I have prepared an example for you (attached) so that you can have a look at
how to use all of this (it is simpler than it may seem). Here it is the
output on my machine:

$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)

so, once the data is stored in HDF5, the read times will be much faster than
using CSV (as expected).

HTH,

Francesc

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

nitinchandra1 · February 5, 2017, 5:10pm

Hi All,

Any solution would be helpful.

Thank you,

Nitin

···

On 2 February 2017 at 00:34, nitin chandra <nitinchandra1@gmail.com> wrote:

Hi Francesc,

I tried your example as it is, could not get time to modify and try
some thing new.

ran the

$ python csv_demo.py

it did create a CSV file with 10 columns, populating the columns with random no.

The demo.h5 was created, and I used HDFView 2.9 to see the contents of
the demo.h5 file.

created were a directory table,

and data table - table.

In the data table - table, there are 2 columns

index | value_block_0

empty | no value
no data | but 10 commas

So that I can relate to your guidance with respect to the issue,
please find attached 2 sample files.
Also, note the first row in CSVs attached, this was created to
initialise the start point of data sequence. Will it be a good
practice to have them in h5 tables also ? Last column has string
values, need them.

ALIGN data goes into file1 and GRADE data into File2, so I am looking
for a write function to write into respective tables and then read
function to read from them.

After the data is in H5 file, can I insert/add/append a new row in
between other rows or at end of file ? Which editor to use or method
to do it in ?

Thank you,

Nitin

On 30 January 2017 at 23:01, nitin chandra <nitinchandra1@gmail.com> wrote:

Thank you Francesc,

Please give me 2-3 days try your example ... do some reading and
testes based as per the link mentioned.

I shall repost soon.

Thank you

Nitin

On 30 January 2017 at 17:14, Francesc Altet <faltet@hdfgroup.org> wrote:

Hi Nitin,

I think before getting into details, you need to look into how to
efficiently read and write data from CSV files into HDF5 in Python. For
this, pandas is a great library to use. My advice is to have a look at the
excellent documentation in pandas website:

http://pandas.pydata.org/pandas-docs/stable/io.html

In particular, you want to use the `pandas.read_csv()` which one of the
fastest ways to read CSV files that I am aware of. Also, for storing the
data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
files out of pandas Dataframes. In addition, in order to avoid loading all
the data in a Dataframe in memory, you want to use the `chunksize` keyword
that will allow to read the CSV files in chunks before storing.

I have prepared an example for you (attached) so that you can have a look at
how to use all of this (it is simpler than it may seem). Here it is the
output on my machine:

$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)

so, once the data is stored in HDF5, the read times will be much faster than
using CSV (as expected).

HTH,

Francesc

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Francesc_Alted · February 6, 2017, 8:58am

Hi Nitin,

Yes, HDF5 files generated in pandas can be appended with more rows easily using the HDFStore.append() method (as shown in the documentation and in my examples).

Regarding visualizations, pandas uses its own format on top of HDF5 to store dataframes, so this is why using a standard HDF5 viewer (like HDFView) is not showing the table (i.e. compound type) that you might expect. For this, it is better to use pandas itself to read the HDF5 dataset (or parts of it) and then visualize the resulting dataframe with one of many existing tools that interacts well with pandas:

http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#visualization

Take your time to decide which tool works best for your case. Meanwhile, you can have a glance at the kind of plots that can produce plotly with HDF5 files produced by pandas:

https://plot.ly/python/pytables

In general, and if you want to proceed with the pandas path, you may want to ask in the pandas mailing list, where far more people will be ready for helping you.

Francesc Alted

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of nitin chandra <nitinchandra1@gmail.com>
Sent: Wednesday, February 1, 2017 8:04:58 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] CSV data into HDF5 data structure and files

Hi Francesc,

I tried your example as it is, could not get time to modify and try
some thing new.

ran the

$ python csv_demo.py

it did create a CSV file with 10 columns, populating the columns with random no.

The demo.h5 was created, and I used HDFView 2.9 to see the contents of
the demo.h5 file.

created were a directory table,

and data table - table.

In the data table - table, there are 2 columns

index | value_block_0

empty | no value
no data | but 10 commas

So that I can relate to your guidance with respect to the issue,
please find attached 2 sample files.
Also, note the first row in CSVs attached, this was created to
initialise the start point of data sequence. Will it be a good
practice to have them in h5 tables also ? Last column has string
values, need them.

ALIGN data goes into file1 and GRADE data into File2, so I am looking
for a write function to write into respective tables and then read
function to read from them.

After the data is in H5 file, can I insert/add/append a new row in
between other rows or at end of file ? Which editor to use or method
to do it in ?

Thank you,

Nitin

On 30 January 2017 at 23:01, nitin chandra <nitinchandra1@gmail.com> wrote:

Thank you Francesc,

Please give me 2-3 days try your example ... do some reading and
testes based as per the link mentioned.

I shall repost soon.

Thank you

Nitin

On 30 January 2017 at 17:14, Francesc Altet <faltet@hdfgroup.org> wrote:

Hi Nitin,

I think before getting into details, you need to look into how to
efficiently read and write data from CSV files into HDF5 in Python. For
this, pandas is a great library to use. My advice is to have a look at the
excellent documentation in pandas website:

http://pandas.pydata.org/pandas-docs/stable/io.html

In particular, you want to use the `pandas.read_csv()` which one of the
fastest ways to read CSV files that I am aware of. Also, for storing the
data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
files out of pandas Dataframes. In addition, in order to avoid loading all
the data in a Dataframe in memory, you want to use the `chunksize` keyword
that will allow to read the CSV files in chunks before storing.

I have prepared an example for you (attached) so that you can have a look at
how to use all of this (it is simpler than it may seem). Here it is the
output on my machine:

$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)

so, once the data is stored in HDF5, the read times will be much faster than
using CSV (as expected).

HTH,

Francesc

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: x.com

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

CSV data into HDF5 data structure and files