Compression of variable lenght datatypes?

Hi,

I am working with data packets which have varying lenght, in the range 1 up
to several k. And I have thousends of them.
I am looking for a way to store them in HDF 5 with compression.

I know that chunket datasets can be compressed. However, I could not get
compression working for chunked datasets which do store variable lenght
(vln) datatypes. At least not with szip. I did not tried zlib so far because
I build hdf5 without zlib support.
Would it work with zlib? Should I try it?

I also did look at the Hdf5 Packet table API. The documentation for
*hid_t* H5PTcreate_fl
states:
The datatype, dtype_id, may specify any datatype, including variable-length
data.

However, It is not clear to me form the documentation which type of
compression is applied. I do interpret the documentation

Compression level, a value of 0 through 9

That zlib compression is required?
Will compression on PT work with variable length datatypes?
Do I need to build hdf5 with zlib support?

Is there a different way to model the data? I thought about creating a
separate dataset for every packet as an option. However, I imagine that
having several k datasets in group might deteriotate access performance. Is
it something to worry about?

Thanks a lot in advance for some hints.

Eryk

···

--
Witold Eryk Wolski

Heidmark str 5
D-28329 Bremen
tel.: 04215261837

The way that HDF stores variable length data has the dataset itself
containing an array of small structs that "point to" the actual data,
which is stored elsewhere. When you specify compression for a VL
dataset, the actual data are stored as-is, but the "pointer" structs are
compressed. This would probably not lead to significant space saving in
the file.

When I learned this, I gave up on compressing VL data, but I would think
that it should follow the same technique as for non-VL data, specify a
chunk size and apply the filter, but I have not done it.

George Lewandowski
(314)777-7890
Mail Code S270-2204
Building 270-E Level 2E Room 20E
P-8A

···

________________________________

From: W Eryk Wolski [mailto:wewolski@gmail.com]
Sent: Thursday, August 06, 2009 4:42 PM
To: HDF forum
Subject: [Hdf-forum] Compression of variable lenght datatypes?

Hi,

I am working with data packets which have varying lenght, in the range
1 up to several k. And I have thousends of them.
I am looking for a way to store them in HDF 5 with compression.

I know that chunket datasets can be compressed. However, I could not
get compression working for chunked datasets which do store variable
lenght (vln) datatypes. At least not with szip. I did not tried zlib so
far because I build hdf5 without zlib support.
Would it work with zlib? Should I try it?

I also did look at the Hdf5 Packet table API. The documentation for
hid_t H5PTcreate_fl
states:
The datatype, dtype_id, may specify any datatype, including
variable-length data.

However, It is not clear to me form the documentation which type of
compression is applied. I do interpret the documentation

Compression level, a value of 0 through 9

That zlib compression is required?
Will compression on PT work with variable length datatypes?
Do I need to build hdf5 with zlib support?

Is there a different way to model the data? I thought about creating a
separate dataset for every packet as an option. However, I imagine that
having several k datasets in group might deteriotate access performance.
Is it something to worry about?

Thanks a lot in advance for some hints.

Eryk

--
Witold Eryk Wolski

Heidmark str 5
D-28329 Bremen
tel.: 04215261837