very large file size

Nikhil_Laghave1 · June 20, 2008, 8:22pm

Hello Everybody,

I am working on a large fortran code and I am trying to change the data format
from binary file to HDF5 file. However, I have found that the file size is
extremely large compared to binary files.

The binary output file is about 10 times smaller than the HDF5 file.
I am not sure if this affects the IO speed or not. Can anybody give me any
information on this topic.

These are the questions I have:

1. Will the file size(10 times larger than binary) affect the IO speed?
2. Can I reduce the file size substantially ? I tried using the set_deflate
option while dataset creation but since the dataset is fully occupied with a
very large vector, the compression does not help ?

It is extremely important to reduce the file size, because for larger runs,
the binary output file is several gigabytes in size and I cant affort a 10
times increase in size with HDF5.

3. Does HDF5 store the data in a form similar to ASCII. Because even ASCII
files seem to be around 10 times larger than binary files.

Kindly suggest something.

Thanks.

Regards,
Nikhil

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

MuQun_Yang · June 20, 2008, 9:58pm

How big is your data array? In what storage you are using: chunking or contiguous?

The overhead should be caused either by the size of your array or the storage you used.

If your array size is only a few KB or you are using chunking storage and your chunk size is very small to store your data, the file size overhead may be very large.

Kent
Nikhil Laghave wrote:

···

Hello Everybody,

I am working on a large fortran code and I am trying to change the data format from binary file to HDF5 file. However, I have found that the file size is extremely large compared to binary files.

The binary output file is about 10 times smaller than the HDF5 file.
I am not sure if this affects the IO speed or not. Can anybody give me any
information on this topic.

These are the questions I have:

1. Will the file size(10 times larger than binary) affect the IO speed?
2. Can I reduce the file size substantially ? I tried using the set_deflate option while dataset creation but since the dataset is fully occupied with a very large vector, the compression does not help ?

It is extremely important to reduce the file size, because for larger runs, the binary output file is several gigabytes in size and I cant affort a 10 times increase in size with HDF5.

3. Does HDF5 store the data in a form similar to ASCII. Because even ASCII files seem to be around 10 times larger than binary files.

Kindly suggest something.

Thanks.

Regards,
Nikhil

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ray · June 20, 2008, 9:57pm

Depending up on how you define your data structures, it may have a bearing
on the problem. In C++, 8 bit or 16 bit values may be 32 bit aligned. If
you don't pack the structure, there may be 'dead space' stored. (if you are
using structures, that is, it has been many many years since I've done
Fortran).

For creating a datset, here are steps I've taken to get things compressed.
It may not be optimal, but it is better than with what I first started.

if ( bNeedToCreateDataSet ) {

CompType *pdt = DD::DefineDataType(); // my class defines the structure
pdt->pack(); // get rid of dross as mentioned above

    DataSpace *pds = new H5::DataSpace( H5S_SIMPLE );
    hsize_t curSize = 0;
    hsize_t maxSize = H5S_UNLIMITED; // provide unlimited growth, 1 dim
array
    pds->setExtentSimple( 1, &curSize, &maxSize );

    DSetCreatPropList pl;
    hsize_t sizeChunk = CHDF5DataManager::H5ChunkSize(); // constant is
elsewhere
    pl.setChunk( 1, &sizeChunk ); // chunking allows growth and compression
    pl.setShuffle(); // interesting maneuver to get rid of leading/trailing
0's
    pl.setDeflate(5); // compression, I have no idea what is optimal value

    dataset
       = new DataSet( dm.GetH5File()->createDataSet( sPathName, *pdt, *pds,
pl ) );
    dataset->close();
    pds->close();
    pdt->close();
    delete pds;
    delete pdt;
    delete dataset;
  }

···

-----Original Message-----
From: Nikhil Laghave [mailto:nikhill@iastate.edu]
Sent: Friday, June 20, 2008 17:22
To: HDF Forum
Subject: [hdf-forum] very large file size

Hello Everybody,

I am working on a large fortran code and I am trying to
change the data format from binary file to HDF5 file.
However, I have found that the file size is extremely large
compared to binary files.

The binary output file is about 10 times smaller than the HDF5 file.
I am not sure if this affects the IO speed or not. Can
anybody give me any information on this topic.

These are the questions I have:

1. Will the file size(10 times larger than binary) affect the
IO speed?
2. Can I reduce the file size substantially ? I tried using
the set_deflate option while dataset creation but since the
dataset is fully occupied with a very large vector, the
compression does not help ?

It is extremely important to reduce the file size, because
for larger runs, the binary output file is several gigabytes
in size and I cant affort a 10 times increase in size with HDF5.

3. Does HDF5 store the data in a form similar to ASCII.
Because even ASCII files seem to be around 10 times larger
than binary files.

Kindly suggest something.

Thanks.

Regards,
Nikhil

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Nikhil_Laghave1 · June 21, 2008, 4:47am

How big is your data array? In what storage you are using: chunking or
contiguous?

My array size is quite large. For larger runs of the code,
it may go upto 2GB.
For smaller runs its only a few bytes to a few kilobytes.

Since the size of the array changes depending on the input file,
I needed an extendible dataset. So chunking had to enabled, since
it is mandatory for extendible datasets.

The overhead should be caused either by the size of your array or the
storage you used.

If your array size is only a few KB or you are using chunking storage
and your chunk size is very small to store your data, the file size
overhead may be very large.

I think you are right. The chunk size was quite small previously.
I increased it to a value depending on the input file and the
size of the output file has gone down dramatically.

Thanks for pointing this out. I wasn't aware that chunking can cause
such a lot of overhead.

Kent

Thanks and Regards,
Nikhil

Nikhil Laghave wrote:
> Hello Everybody,
>
> I am working on a large fortran code and I am trying to change the data

format

> from binary file to HDF5 file. However, I have found that the file size is
> extremely large compared to binary files.
>
> The binary output file is about 10 times smaller than the HDF5 file.
> I am not sure if this affects the IO speed or not. Can anybody give me any
> information on this topic.
>
> These are the questions I have:
>
> 1. Will the file size(10 times larger than binary) affect the IO speed?
> 2. Can I reduce the file size substantially ? I tried using the

set_deflate

> option while dataset creation but since the dataset is fully occupied with

a

> very large vector, the compression does not help ?
>
> It is extremely important to reduce the file size, because for larger

runs,

> the binary output file is several gigabytes in size and I cant affort a 10
> times increase in size with HDF5.
>
> 3. Does HDF5 store the data in a form similar to ASCII. Because even ASCII
> files seem to be around 10 times larger than binary files.
>
> Kindly suggest something.
>
> Thanks.
>
> Regards,
> Nikhil
>
>
>
>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-

subscribe@hdfgroup.org.

···

> To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
>
>
>
>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · June 21, 2008, 7:45pm

Hi Nikhil,

Just to cool you down a little bit, in my case the overhead for writing hdf5
files is only a few kilobytes. I have written files up to 120GB. I assume
you are missing something.

HTH

-- dimitris

···

2008/6/21 Nikhil Laghave <nikhill@iastate.edu>:

>
> How big is your data array? In what storage you are using: chunking or
> contiguous?

My array size is quite large. For larger runs of the code,
it may go upto 2GB.
For smaller runs its only a few bytes to a few kilobytes.

Since the size of the array changes depending on the input file,
I needed an extendible dataset. So chunking had to enabled, since
it is mandatory for extendible datasets.

>
> The overhead should be caused either by the size of your array or the
> storage you used.
>
> If your array size is only a few KB or you are using chunking storage
> and your chunk size is very small to store your data, the file size
> overhead may be very large.
>

I think you are right. The chunk size was quite small previously.
I increased it to a value depending on the input file and the
size of the output file has gone down dramatically.

Thanks for pointing this out. I wasn't aware that chunking can cause
such a lot of overhead.

> Kent

Thanks and Regards,
Nikhil

> Nikhil Laghave wrote:
> > Hello Everybody,
> >
> > I am working on a large fortran code and I am trying to change the data
format
> > from binary file to HDF5 file. However, I have found that the file size
is
> > extremely large compared to binary files.
> >
> > The binary output file is about 10 times smaller than the HDF5 file.
> > I am not sure if this affects the IO speed or not. Can anybody give me
any
> > information on this topic.
> >
> > These are the questions I have:
> >
> > 1. Will the file size(10 times larger than binary) affect the IO speed?
> > 2. Can I reduce the file size substantially ? I tried using the
set_deflate
> > option while dataset creation but since the dataset is fully occupied
with
a
> > very large vector, the compression does not help ?
> >
> > It is extremely important to reduce the file size, because for larger
runs,
> > the binary output file is several gigabytes in size and I cant affort a
10
> > times increase in size with HDF5.
> >
> > 3. Does HDF5 store the data in a form similar to ASCII. Because even
ASCII
> > files seem to be around 10 times larger than binary files.
> >
> > Kindly suggest something.
> >
> > Thanks.
> >
> > Regards,
> > Nikhil
> >
> >
> >
> >
> >
> > ----------------------------------------------------------------------
> > This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to hdf-forum-
subscribe@hdfgroup.org.
> > To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
> >
> >
> >
> >
>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

very large file size