Hi,
Here is some code I have,
There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:
ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')
Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.
Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:
ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)
offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr
ds.resize(<final row count>, axis=0)
myfile.close()
Last, (although it won't affect performance) you can replace the pattern:
try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"
with this:
group = f.require_group(grp).
Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. 
Andrew
···
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.