why changing the format had adverse effect

Hi everyone,

I store financial instruments intraday market data in hdf5 files.
Recently I decided to change the format thinking that the previous
format was space wasting. New format should be taking less space but
its twice bigger in size. Following is the layout

OldFormat

···

---------------
SECURITY1
    - QUOTES
    - TRADES

In Old format, each group is the name of the security. Each group
consists of 2 datasets, quotes and trades. Quotes is a 2D array where
the number of columns is 4 * depth of order book + 1 for timestamp
So for Australia is 100 columns since we have 20 levels of market data.

Dataset looks as follows

Timestamp Bid0 Ask0 Bidsize0 Asksize0 Bid1 Ask1 Bidsize1 Asksize1 Bid2
Ask2 Bidsize2 Asksize2 Bid3 Ask3 Bidsize3 Asksize3 .... etc

In this case .. now even if any single value changes .. I used to write
one new row (even for a single change). I thought this was very
inefficient way. So I changed the format as follows:

New Format
------------------

SECURITY1
   -LEVEL1
   -LEVEL2
   -LEVEL3
   -LEVEL4
   -LEVEL5
   .
   .
   .
   -LEVEL20
   -TRADES

Now, depending on which level is updated, I add row in corresponding
dataset only. This means I have much less data compared to my previous
data format.

But my output hdf5 file is twice in size using the new format.

They both use same compression logic

Repack.exe -f GZIP=5 -l CHUNK=2056x1 sourcefile targetfile.

I am not sure why would the file size be larger since I write less data
now. Is it because I have large number of datasets? What is it that I am
missing here?

Any suggestions?

Regards,

Alok Jadhav
CREDIT SUISSE AG
GAT IT Hong Kong, KVAG 67
International Commerce Centre | Hong Kong | Hong Kong
Phone +852 2101 6274 | Mobile +852 9169 7172
alok.jadhav@credit-suisse.com | www.credit-suisse.com

===============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html

simpliyfying the question further

format 1 has less rows but more columns

e.g. 8303 rows * 122 columns = 1013576 elements

format 2 has more rows and more datasets of less number of columns

e.g. 17488 rows ( for all levels) * 7 columns (each level) = 122136 elements

1013576 > 122136.

Formar 1 has more elements, but is it possible that format 1 takes less
space than format 2 because of less number of datasets?

How does this work?

Regards,

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-adverse-effect-tp4025330p4025339.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi,

could someone comment on this? I am still not sure why new format with less
number of elements is taking so much more storage space. One more
observation is that

format 1 has around 300 groups ..each with 2 datasets. -> total 600
datasets.
format 2 has around 200 groups .. each with 11 datasets -> total of 2200
datasets.

in fromat 1, each dataset is a double array where as in format 2, each
dataset is a complex type. (doubles and ints mixed).

What is the overhead of having a complex dtype vs a double array. having
2200 datasets vs 600 datasets, can it double the size of the hdf5 file?

Regards,
Alok

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-adverse-effect-tp4025330p4025343.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi,

could someone comment on this? I am still not sure why new format with less
number of elements is taking so much more storage space. One more
observation is that

format 1 has around 300 groups ..each with 2 datasets. -> total 600
datasets.
format 2 has around 200 groups .. each with 11 datasets -> total of 2200
datasets.

in fromat 1, each dataset is a double array where as in format 2, each
dataset is a complex type. (doubles and ints mixed).

What is the overhead of having a complex dtype vs a double array. having
2200 datasets vs 600 datasets, can it double the size of the hdf5 file? i am
basically converting horizontal data into vertical data with more datasets.

Regards,
Alok

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-adverse-effect-tp4025330p4025344.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Hi Alok,

Please try to run h5stat tool (http://www.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Stat) to see how space is allocated in the file for raw data and HDF5 metadata.

Elena

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Aug 26, 2012, at 9:15 PM, alokjadhav wrote:

Hi,

could someone comment on this? I am still not sure why new format with less
number of elements is taking so much more storage space. One more
observation is that

format 1 has around 300 groups ..each with 2 datasets. -> total 600
datasets.
format 2 has around 200 groups .. each with 11 datasets -> total of 2200
datasets.

in fromat 1, each dataset is a double array where as in format 2, each
dataset is a complex type. (doubles and ints mixed).

What is the overhead of having a complex dtype vs a double array. having
2200 datasets vs 600 datasets, can it double the size of the hdf5 file? i am
basically converting horizontal data into vertical data with more datasets.

Regards,
Alok

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-adverse-effect-tp4025330p4025344.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks Elena.

I still have some more questions. I am trying to optimize my datasets
for faster access. My observation is that the time to access complete
data increases linearly with the number of datasets. Say there is only 1
dataset in each group vs there are 10 datasets in each group. I wrote a
program to read just 1 datasets from each group from both files. First
file which contains only 1 dataset takes very less time compared to file
with 10 datasets. 2nd file is 2 -3 times slower that first file (3
seconds vs 9 seconds). Both reading same amount of data (dataset that I
read from both file contains same data).

I have repacked the file such that the chunk size equals to the
dimensions of each dataset.

This is not what I expect. Since the structure of hdf5 file is similar
to unix files, number of files should not affect the access time as long
as you are reading same amount of data. Datasets are accessed using
points. What is it that I am doing wrong? How can I maximize my reading
performance.

My data looks as follows:

Each group has 5 levels of market data.

Different formats I have tried

File 1: file has level 1 in each group. (Fastest but only provides
level1)

File 2: file has 5 datasets for each level within each group. (this is
scaling linearly. Slower than file 1 format. Even if I read only level1,
still very slow compared to file 1)

File 3: file has all 5 levels in 1 dataset (horizontally spread).

My reading access pattern: I have to read either only level1 or all 5
levels together.

I am thinking of 2 different files, 1 with level1 dataset and other with
all datasets in 1 file. I feel this is quite inefficient. I would like
to keep all the data in single file. Do you guys have any suggestions?

Alok Jadhav

GAT IT

···

From: hdf-forum-bounces@hdfgroup.org
[mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of Elena Pourmal
Sent: Monday, August 27, 2012 8:45 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] why changing the format had adverse effect

Hi Alok,

Please try to run h5stat tool
(http://www.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Stat) to see how
space is allocated in the file for raw data and HDF5 metadata.

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal The HDF Group http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Aug 26, 2012, at 9:15 PM, alokjadhav wrote:

Hi,

could someone comment on this? I am still not sure why new format with
less
number of elements is taking so much more storage space. One more
observation is that

format 1 has around 300 groups ..each with 2 datasets. -> total 600
datasets.
format 2 has around 200 groups .. each with 11 datasets -> total of
2200
datasets.

in fromat 1, each dataset is a double array where as in format 2, each
dataset is a complex type. (doubles and ints mixed).

What is the overhead of having a complex dtype vs a double array. having
2200 datasets vs 600 datasets, can it double the size of the hdf5 file?
i am
basically converting horizontal data into vertical data with more
datasets.

Regards,
Alok

--
View this message in context:
http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-advers
e-effect-tp4025330p4025344.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

===============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/why-changing-the-format-had-adverse-effect-tp4025330p4025351.html
Sent from the hdf-forum mailing list archive at Nabble.com.