writing large number of data sets


We wrote an application to store tick data in HDF5 files.

We have 8 dumpcap ( wireshark) instances capturing multicast tick data and
storing a pcap file for 5 minutes.

We wrote a c program which reads a pcap file and store data in HDF5 file. We
are creating one HDF5 file for a day.
We creating three data-set for an instrument using packet table API.

We are using following compound datatypes for these tables.

typedef struct tick_t {
  int64_t bid_trade_value;
  int64_t ask_value;
  int32_t cap_sec;
  int32_t cap_usec;
  int32_t exg_sec;
  uint32_t exg_seq_id;
  uint32_t tfp_seq_id;
  uint32_t lrt_id;
  uint32_t bid_trade_size;
  uint32_t ask_size;
  uint32_t flags2;
  uint16_t flags;
  uint8_t bid_trade_base;
  uint8_t ask_base;
  uint8_t bid_trade_exg;
  uint8_t ask_exg;
} Tick;

typedef struct lrt_t {
  hvl_t lrt;
} Lrt;

typedef struct index_t {
  int64_t high_value;
  int64_t low_value;
  int64_t open_value;
  int64_t close_value;
  int32_t minute;
  uint32_t tick_start;
  uint32_t tick_end;
  uint8_t high_base;
  uint8_t low_base;
  uint8_t open_base;
  uint8_t close_base;
} Index;

And resultant HDF5 file stats are as follows.


We are running 8 instances of this c program, which opens the HDF5 file and
writes the data.
Max size of a pcap file is 300 MB.

The problem is, processing a 5 minute pcap file and storing data in HDF5
taking more than 5 minutes ( some times 30 minutes).

From using timing the functions, I see that the bottleneck is HDF5 file

writing. For each instrument i am creating a group
and creating three data-sets in it. My file writing code is as follows.

  hid_t group_id;

  if(!H5Lexists( file_id, symbol, H5P_DEFAULT ) )
    group_id = H5Gcreate( file_id, symbol, H5P_DEFAULT, H5P_DEFAULT,
    group_id = H5Gopen( file_id, symbol, H5P_DEFAULT );
  if(!H5Lexists( group_id, "ticks", H5P_DEFAULT ) )
    hid_t tick_type = H5Topen (file_id, "tick_type", H5P_DEFAULT);
    hid_t ptable = H5PTcreate_fl(group_id, "ticks", tick_type, 128, -1);
    herr_t err = H5PTappend(ptable, tick_len, tick_buf );
    err = H5PTclose(ptable);
    hid_t ptable =H5PTopen(group_id, "ticks");
    herr_t err = H5PTappend(ptable, tick_len, tick_buf );
    err = H5PTclose(ptable);


I need achieve this with in 5 minutes.

1. How can I make HDF5 writing faster?

2. Is large number of data-sets is a problem?

3. I am using chunk size 100. Can anybody suggest more appropriate size
seeing above stats?

4. If I use low level data-set API instead of packet table, will get write
performance improvement?


View this message in context: http://hdf-forum.184993.n3.nabble.com/writing-large-number-of-data-sets-tp4025530.html
Sent from the hdf-forum mailing list archive at Nabble.com.

And I am seeing my processes in D state for long times in top utility. I am
using SLES 11 SP1.


View this message in context: http://hdf-forum.184993.n3.nabble.com/writing-large-number-of-data-sets-tp4025530p4025531.html
Sent from the hdf-forum mailing list archive at Nabble.com.