Hi,
We wrote an application to store tick data in HDF5 files.
We have 8 dumpcap ( wireshark) instances capturing multicast tick data and
storing a pcap file for 5 minutes.
We wrote a c program which reads a pcap file and store data in HDF5 file. We
are creating one HDF5 file for a day.
We creating three data-set for an instrument using packet table API.
We are using following compound datatypes for these tables.
typedef struct tick_t {
int64_t bid_trade_value;
int64_t ask_value;
int32_t cap_sec;
int32_t cap_usec;
int32_t exg_sec;
uint32_t exg_seq_id;
uint32_t tfp_seq_id;
uint32_t lrt_id;
uint32_t bid_trade_size;
uint32_t ask_size;
uint32_t flags2;
uint16_t flags;
uint8_t bid_trade_base;
uint8_t ask_base;
uint8_t bid_trade_exg;
uint8_t ask_exg;
} Tick;
typedef struct lrt_t {
hvl_t lrt;
} Lrt;
typedef struct index_t {
int64_t high_value;
int64_t low_value;
int64_t open_value;
int64_t close_value;
int32_t minute;
uint32_t tick_start;
uint32_t tick_end;
uint8_t high_base;
uint8_t low_base;
uint8_t open_base;
uint8_t close_base;
} Index;
And resultant HDF5 file stats are as follows.
fut_mc1_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/fut_mc1_20121011.txt>
fut_mc2_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/fut_mc2_20121011.txt>
fut_mc3_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/fut_mc3_20121011.txt>
onl_mc2_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/onl_mc2_20121011.txt>
onl_mc3_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/onl_mc3_20121011.txt>
onl_mc4_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/onl_mc4_20121011.txt>
onl_mc5_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/onl_mc5_20121011.txt>
onl_mc6_20121011.txt
<http://hdf-forum.184993.n3.nabble.com/file/n4025530/onl_mc6_20121011.txt>
We are running 8 instances of this c program, which opens the HDF5 file and
writes the data.
Max size of a pcap file is 300 MB.
The problem is, processing a 5 minute pcap file and storing data in HDF5
taking more than 5 minutes ( some times 30 minutes).
From using timing the functions, I see that the bottleneck is HDF5 file
writing. For each instrument i am creating a group
and creating three data-sets in it. My file writing code is as follows.
hid_t group_id;
if(!H5Lexists( file_id, symbol, H5P_DEFAULT ) )
{
group_id = H5Gcreate( file_id, symbol, H5P_DEFAULT, H5P_DEFAULT,
H5P_DEFAULT );
}
else
{
group_id = H5Gopen( file_id, symbol, H5P_DEFAULT );
}
if(!H5Lexists( group_id, "ticks", H5P_DEFAULT ) )
{
hid_t tick_type = H5Topen (file_id, "tick_type", H5P_DEFAULT);
hid_t ptable = H5PTcreate_fl(group_id, "ticks", tick_type, 128, -1);
herr_t err = H5PTappend(ptable, tick_len, tick_buf );
err = H5PTclose(ptable);
}
else
{
hid_t ptable =H5PTopen(group_id, "ticks");
herr_t err = H5PTappend(ptable, tick_len, tick_buf );
err = H5PTclose(ptable);
}
H5Gclose(group_id);
I need achieve this with in 5 minutes.
1. How can I make HDF5 writing faster?
2. Is large number of data-sets is a problem?
3. I am using chunk size 100. Can anybody suggest more appropriate size
seeing above stats?
4. If I use low level data-set API instead of packet table, will get write
performance improvement?
···
--
View this message in context: http://hdf-forum.184993.n3.nabble.com/writing-large-number-of-data-sets-tp4025530.html
Sent from the hdf-forum mailing list archive at Nabble.com.