combining several datasets to make one large dataset

Ryan_Price · August 18, 2011, 7:34pm

Hi,

I have a 4GB binary dump of data that I'd like to store as a hdf5 dataset
(using command line tools if possible). The data set will have the
dimensions 31486448 x 128. I believe this is too big to import as a data
set in one go.

Running h5import gives the following error:
Unable to allocate dynamic memory.
Error in allocating unsigned integer data storage.
Error in reading the input file: my_data
Program aborted.

So I split the binary dump into four files which can be imported. I'd still
like to have one 31486448 x 128 dataset but am not sure that's possible to
do.

Any idea how I could combine these four binary dumps into one data set.
Maybe create a single dataset and append each small one...?

Thanks,

Ryan

daniel.kahn · August 19, 2011, 4:40pm

Ryan,

Ryan Price wrote:

I have a 4GB binary dump of data that I’d like to store as
a hdf5 dataset (using command line tools if possible). The data set
will have the dimensions 31486448 x 128. I believe this is too big to
import as a data set in one go.

4GB is a large array. You may wish to give some thought to how the
data will be used after you have created the file. Will the end user
really process all 4GB at once? HDF5 provides chunking and compression
functionality which (transparently to the data reader, and almost
transparently to the writer) will store the data in “chunks” and
compress them as well, if you’d like. If you can make the chunk size
close to amount of data the end user will want to access it can be very
convenient.

Here is a piece of Python I wrote to demonstrate creating a file with
chunking and compression. I was able to open the file and view the
dataset properties in HDFView, but not the dataset itself because the
array is so large. You can use this code if the major axis of your
binary dump is in the 128 direction. If it is in the other direction
you’ll probably want to choose different chunking parameters, and read
the binary data off disk appropriately. (By the way, I get a 3.8MB
file since the array contains a single value and I’ve turned on
compression.)

import numpy; import h5py

with h5py.File('BigArray.h5','w') as fid:

dset = fid.create_dataset('BigArray',shape=(31486448, 128),dtype='int8',chunks=(31486448, 1), compression='gzip')

slicearray = numpy.ones([31486448],dtype='int8') # one-D array, of length 31486448

for i in range(128):

# replace this comment with read from binary file into slicearray

print "Writing",i

dset[:,i] = slicearray # populate slice i of the HDF5 dataset.
Cheers,

–dan

···

Hdf-forum@hdfgroup.org http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

-- Daniel Kahn
Science Systems and Applications Inc.
301-867-2162

Ryan_Price · August 22, 2011, 4:35pm

Thanks Dan. This was very helpful.

Ryan

···

On Fri, Aug 19, 2011 at 9:40 AM, Daniel Kahn <daniel.kahn@ssaihq.com> wrote:

**
Ryan,

Ryan Price wrote:

I have a 4GB binary dump of data that I'd like to store as a hdf5 dataset
(using command line tools if possible). The data set will have the
dimensions 31486448 x 128. I believe this is too big to import as a data
set in one go.

4GB is a large array. You may wish to give some thought to how the data
will be used after you have created the file. Will the end user really
process all 4GB at once? HDF5 provides chunking and compression
functionality which (transparently to the data reader, and almost
transparently to the writer) will store the data in "chunks" and compress
them as well, if you'd like. If you can make the chunk size close to amount
of data the end user will want to access it can be very convenient.

Here is a piece of Python I wrote to demonstrate creating a file with
chunking and compression. I was able to open the file and view the dataset
properties in HDFView, but not the dataset itself because the array is so
large. You can use this code if the major axis of your binary dump is in
the 128 direction. If it is in the other direction you'll probably want to
choose different chunking parameters, and read the binary data off disk
appropriately. (By the way, I get a 3.8MB file since the array contains a
single value and I've turned on compression.)

import numpy; import h5py

with h5py.File('BigArray.h5','w') as fid:
    dset = fid.create_dataset('BigArray',shape=(31486448,
128),dtype='int8',chunks=(31486448, 1), compression='gzip')
    slicearray = numpy.ones([31486448],dtype='int8') # one-D array, of
length 31486448
    for i in range(128):
        # replace this comment with read from binary file into slicearray
        print "Writing",i
        dset[:,i] = slicearray # populate slice i of the HDF5 dataset.

Cheers,
--dan

Running h5import gives the following error:
Unable to allocate dynamic memory.
Error in allocating unsigned integer data storage.
Error in reading the input file: my_data
Program aborted.

So I split the binary dump into four files which can be imported. I'd
still like to have one 31486448 x 128 dataset but am not sure that's
possible to do.

Any idea how I could combine these four binary dumps into one data set.
Maybe create a single dataset and append each small one...?

Thanks,

Ryan

------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.Hdf-forum@hdfgroup.orghttp://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Daniel Kahn
Science Systems and Applications Inc.301-867-2162

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

combining several datasets to make one large dataset