h5py -- most efficient way to load a hdf5

Hello All,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

I was wondering if anyone knows of any techniques to load this file faster?

TIA

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

Since you're using Python you should investigate the functions
numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
about is finding a way to iterate over rows in the input file. You
can create an HDF5 dataset with the proper size and dtype, and then
fill it in row by row as you read records in from the csv file. That
way you avoid having to load the entire file into memory.

As far as the datatypes, if all the rows of your CSV have the same
fields, the dtype for the HDF5 file should be something like:

vl_str = h5py.new_vlen(str)
mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
vl_str), ('Field4', vl_str). ... <the rest> ...])

This strategy will create an HDF5 dataset whose elements are a
compound type with two integers and four variable-length strings.

HTH,
Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Thanks for the response

Hi,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

Since you're using Python you should investigate the functions
numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
about is finding a way to iterate over rows in the input file. You
can create an HDF5 dataset with the proper size and dtype, and then
fill it in row by row as you read records in from the csv file. That
way you avoid having to load the entire file into memory.

Correct, this is the way I am trying to do it but do I have to worry
about resize? Because each file has different number of rows.

The numpy.from file and loadtxt actually load everything into memory.
Thats the way I am doing it now, and its very inefficient.

Do you have any sample code for line by line read and then push it to
hdf5 file?

···

On Tue, Jun 23, 2009 at 2:46 AM, Andrew Collette<andrew.collette@gmail.com> wrote:

As far as the datatypes, if all the rows of your CSV have the same
fields, the dtype for the HDF5 file should be something like:

vl_str = h5py.new_vlen(str)
mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
vl_str), ('Field4', vl_str). ... <the rest> ...])

This strategy will create an HDF5 dataset whose elements are a
compound type with two integers and four variable-length strings.

HTH,
Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Here is some code I have,

import numpy as np
from numpy import *

import gzip
import h5py
import re
import sys, string, time, getopt
import os

src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]

#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]

f=h5py.File('/tmp/test_foo/FE.hdf5','w')

grp="/"+YYYY
try:
  f.create_group(grp)
except ValueError:
  print "Year group already exists"

grp=grp+"/"+MM
try:
  f.create_group(grp)
except ValueError:
  print "Month group already exists"

grp=grp+"/"+DD
try:
  group=f.create_group(grp)
except ValueError:
  print "Day group already exists"

str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)

dset = f.create_dataset ('Foo',data=arr,compression='gzip')

s=0
for y in fs:
     continue
  a=y.split(',')
  s=s+1
  dset.resize(s,axis=0)
fs.close()

f.close()

This works but just takes a VERY long time.

···

On Tue, Jun 23, 2009 at 4:32 AM, Mag Gam<magawake@gmail.com> wrote:

Thanks for the response

On Tue, Jun 23, 2009 at 2:46 AM, Andrew > Collette<andrew.collette@gmail.com> wrote:

Hi,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

Since you're using Python you should investigate the functions
numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
about is finding a way to iterate over rows in the input file. You
can create an HDF5 dataset with the proper size and dtype, and then
fill it in row by row as you read records in from the csv file. That
way you avoid having to load the entire file into memory.

Correct, this is the way I am trying to do it but do I have to worry
about resize? Because each file has different number of rows.

The numpy.from file and loadtxt actually load everything into memory.
Thats the way I am doing it now, and its very inefficient.

Do you have any sample code for line by line read and then push it to
hdf5 file?

As far as the datatypes, if all the rows of your CSV have the same
fields, the dtype for the HDF5 file should be something like:

vl_str = h5py.new_vlen(str)
mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
vl_str), ('Field4', vl_str). ... <the rest> ...])

This strategy will create an HDF5 dataset whose elements are a
compound type with two integers and four variable-length strings.

HTH,
Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

You're code can certainly be made more efficient. Python is
notoriously slow when using 'for loops'. Numpy has routines for you to
load ascii/csv into a numpy array and then to hdf5. Consider using
chunks and not the whole file in a single shot.

You might also be able to use 'h5import' which is one the command line
utilities available at the hdf web site.

···

On Wed, Jun 24, 2009 at 7:36 AM, Mag Gam<magawake@gmail.com> wrote:

Here is some code I have,

import numpy as np
from numpy import *

import gzip
import h5py
import re
import sys, string, time, getopt
import os

src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]

#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]

f=h5py.File('/tmp/test_foo/FE.hdf5','w')

grp="/"+YYYY
try:
f.create_group(grp)
except ValueError:
print "Year group already exists"

grp=grp+"/"+MM
try:
f.create_group(grp)
except ValueError:
print "Month group already exists"

grp=grp+"/"+DD
try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"

str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)

dset = f.create_dataset ('Foo',data=arr,compression='gzip')

s=0
for y in fs:
continue
a=y.split(',')
s=s+1
dset.resize(s,axis=0)
fs.close()

f.close()

This works but just takes a VERY long time.

On Tue, Jun 23, 2009 at 4:32 AM, Mag Gam<magawake@gmail.com> wrote:

Thanks for the response

On Tue, Jun 23, 2009 at 2:46 AM, Andrew >> Collette<andrew.collette@gmail.com> wrote:

Hi,

I have a very large csv file 14G and I am planning to move all of my
data to hdf5. I am using h5py to load the data. The biggest problem I
am having is, I am putting the entire file into memory and then
creating a dataset from it. This is very inefficient and it takes over
4 hours to create the hdf5 file.

The csv file has various types:
int4, int4, str, str, str, str, str

Since you're using Python you should investigate the functions
numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
about is finding a way to iterate over rows in the input file. You
can create an HDF5 dataset with the proper size and dtype, and then
fill it in row by row as you read records in from the csv file. That
way you avoid having to load the entire file into memory.

Correct, this is the way I am trying to do it but do I have to worry
about resize? Because each file has different number of rows.

The numpy.from file and loadtxt actually load everything into memory.
Thats the way I am doing it now, and its very inefficient.

Do you have any sample code for line by line read and then push it to
hdf5 file?

As far as the datatypes, if all the rows of your CSV have the same
fields, the dtype for the HDF5 file should be something like:

vl_str = h5py.new_vlen(str)
mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
vl_str), ('Field4', vl_str). ... <the rest> ...])

This strategy will create an HDF5 dataset whose elements are a
compound type with two integers and four variable-length strings.

HTH,
Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi,

Here is some code I have,

There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')

Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.

Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)

offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr

ds.resize(<final row count>, axis=0)
myfile.close()

Last, (although it won't affect performance) you can replace the pattern:

try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"

with this:
group = f.require_group(grp).

Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. :slight_smile:

Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Great. Thankyou all for the advice.

I will try this out.

···

On Wed, Jun 24, 2009 at 2:49 PM, Andrew Collette<andrew.collette@gmail.com> wrote:

Hi,

Here is some code I have,

There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')

Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.

Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)

offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr

ds.resize(<final row count>, axis=0)
myfile.close()

Last, (although it won't affect performance) you can replace the pattern:

try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"

with this:
group = f.require_group(grp).

Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. :slight_smile:

Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Andrew:

I am having terrible hard time loading a csv file into an array in general.

For example:
zcat file.csv.gz
red,blue,1,2,3
green,orange,3,2,1
blue,black,2,1,3

I would like to put this into a 3x5 matrix with these types, "string,
string, int4, int4, float 4"

Any ideas? I looked thru the examples but there aren't any for various
types of data and reading a line.

Here is what I have so far:

my mydtype={'names': ('color0','color1','num1','num2','num3','num4'),
'formats': ('S8','S8','i4','i4','f4')}
ds = f.create_dataset ("Foo",(5,),compression=1, dtype=mdtype)

for s, row in enumerate(reader):
  ds[s]=row #Does not work
f.close

Any ideas on how I can place this file into ds? In addition, I would
like to use your optimization.

···

On Wed, Jun 24, 2009 at 7:19 PM, Mag Gam<magawake@gmail.com> wrote:

Great. Thankyou all for the advice.

I will try this out.

On Wed, Jun 24, 2009 at 2:49 PM, Andrew > Collette<andrew.collette@gmail.com> wrote:

Hi,

Here is some code I have,

There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')

Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.

Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)

offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr

ds.resize(<final row count>, axis=0)
myfile.close()

Last, (although it won't affect performance) you can replace the pattern:

try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"

with this:
group = f.require_group(grp).

Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. :slight_smile:

Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Also.

ds[offset:offset+100] = arr

that will load the entire set into memory which could be costly...

···

On Wed, Jun 24, 2009 at 11:26 PM, Mag Gam<magawake@gmail.com> wrote:

Andrew:

I am having terrible hard time loading a csv file into an array in general.

For example:
zcat file.csv.gz
red,blue,1,2,3
green,orange,3,2,1
blue,black,2,1,3

I would like to put this into a 3x5 matrix with these types, "string,
string, int4, int4, float 4"

Any ideas? I looked thru the examples but there aren't any for various
types of data and reading a line.

Here is what I have so far:

my mydtype={'names': ('color0','color1','num1','num2','num3','num4'),
'formats': ('S8','S8','i4','i4','f4')}
ds = f.create_dataset ("Foo",(5,),compression=1, dtype=mdtype)

for s, row in enumerate(reader):
ds[s]=row #Does not work
f.close

Any ideas on how I can place this file into ds? In addition, I would
like to use your optimization.

On Wed, Jun 24, 2009 at 7:19 PM, Mag Gam<magawake@gmail.com> wrote:

Great. Thankyou all for the advice.

I will try this out.

On Wed, Jun 24, 2009 at 2:49 PM, Andrew >> Collette<andrew.collette@gmail.com> wrote:

Hi,

Here is some code I have,

There are a couple of ways you can speed this up. First, resizing a
dataset in HDF5 can be expensive, especially since you're doing it for
each line you read. You will have more success if you create a
dataset with enough "rows" to begin with and then adjust the size as
necessary:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')

Second, you can try using a lower compression ratio, like
"compression=1", and see if that helps. You may even be able to avoid
using compression altogether, since the HDF5 format is more efficient
than CSV.

Third (as Peter Alexander mentioned) is that it's almost certainly
more efficient to read in your CSV in chunks. I'm not personally
familiar with the NumPy methods for reading in CSV data, but
pseudocode for this would be:

ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)

offset = 0
for each group of 100 lines in the file:
arr = (use NumPy to load a (100,) chunk of data from file)
if offset + 100 > NROWS:
ds.resize(offset+100, axis=0)
ds[offset:offset+100] = arr

ds.resize(<final row count>, axis=0)
myfile.close()

Last, (although it won't affect performance) you can replace the pattern:

try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"

with this:
group = f.require_group(grp).

Since you have your input data in a gzipped csv file, it would also be
instructive simply to run:
$ time gzunzip -c myfile.csv.gz > /dev/null
and see how much of your time is spent simply unzipping the input file. :slight_smile:

Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi,

ds[offset:offset+100] = arr

that will load the entire set into memory which could be costly...

No, it will assign the contents of "arr" to a 100-element slice of the
dataset. This is one of the features of HDF5; you don't have to load
the whole thing into memory to modify it. As far as turning a line of
csv into a dataset, you need to have NumPy turn each line of text into
an array element, so it can be stored in the dataset. It won't happen
automatically. One function I bumped into that does that is this one:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromregex.html

Otherwise, you can use line.split(",") and manually convert each one,
as in your previous example. It will be slow, but it will work.

Andrew

PS: We can continue this discussion in private email if you want. I'm
not sure that Python-side CSV translation is a burning priority for
the HDF folks. :slight_smile:

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.