Andrew:
I am having terrible hard time loading a csv file into an array in general.
For example:
zcat file.csv.gz
red,blue,1,2,3
green,orange,3,2,1
blue,black,2,1,3
I would like to put this into a 3x5 matrix with these types, "string,
string, int4, int4, float 4"
Any ideas? I looked thru the examples but there aren't any for various
types of data and reading a line.
Here is what I have so far:
my mydtype={'names': ('color0','color1','num1','num2','num3','num4'),
'formats': ('S8','S8','i4','i4','f4')}
ds = f.create_dataset ("Foo",(5,),compression=1, dtype=mdtype)
for s, row in enumerate(reader):
ds[s]=row #Does not work
f.close
Any ideas on how I can place this file into ds? In addition, I would
like to use your optimization.
···
On Wed, Jun 24, 2009 at 7:19 PM, Mag Gam<magawake@gmail.com> wrote:
Great. Thankyou all for the advice.
I will try this out.
On Wed, Jun 24, 2009 at 2:49 PM, Andrew > Collette<andrew.collette@gmail.com> wrote:
Hi,
Here is some code I have,
There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:
ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')
Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.
Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:
ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)
offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr
ds.resize(<final row count>, axis=0)
myfile.close()
Last, (although it won't affect performance) you can replace the pattern:
try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"
with this:
group = f.require_group(grp).
Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. 
Andrew
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.