Hi,
A PyTables user is reporting that compression does not have an appreciable
effect on datasets of variable length data types. I can see that the filters
in dataset 'see' are being correctly applied in the HDF5 file:
Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
Attribute: CLASS scalar
Type: 8-byte null-terminated ASCII string
Data: "VLARRAY"
Attribute: VERSION scalar
Type: 4-byte null-terminated ASCII string
Attribute: TITLE scalar
Type: 1-byte null-terminated ASCII string
Data: ""
Location: 0:1:0:976
Links: 1
Modified: 2009-05-15 17:45:39 CEST
Chunks: {2048} 32768 bytes
Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
Filter-0: shuffle-2 OPT {16}
Filter-1: deflate-1 OPT {1}
Type: variable length of
native int
and h5ls is reporting a 3.6x compression rate, but the size of the file on
disk is exactly the same as if not filter compressor is used. I'm a bit loss
here :-/ I'm attaching a sample file with the above 'see' dataset for your
inspection. I'd grateful if anybody can tell me what's happening.
Thanks,
vldata.h5 (206 KB)
···
Data: "1.3"
--
Francesc Alted
"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."
-- Edsger W. Dykstra
A Friday 15 May 2009 17:54:43 Francesc Alted escrigué:
Hi,
A PyTables user is reporting that compression does not have an appreciable
effect on datasets of variable length data types. I can see that the
filters in dataset 'see' are being correctly applied in the HDF5 file:
Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
Attribute: CLASS scalar
Type: 8-byte null-terminated ASCII string
Data: "VLARRAY"
Attribute: VERSION scalar
Type: 4-byte null-terminated ASCII string
Data: "1.3"
Attribute: TITLE scalar
Type: 1-byte null-terminated ASCII string
Data: ""
Location: 0:1:0:976
Links: 1
Modified: 2009-05-15 17:45:39 CEST
Chunks: {2048} 32768 bytes
Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
Filter-0: shuffle-2 OPT {16}
Filter-1: deflate-1 OPT {1}
Type: variable length of
native int
and h5ls is reporting a 3.6x compression rate, but the size of the file on
disk is exactly the same as if not filter compressor is used. I'm a bit
loss here :-/ I'm attaching a sample file with the above 'see' dataset for
your inspection. I'd grateful if anybody can tell me what's happening.
Ops, I forgot to add that the dataset should be *highly* compressible (each
entry is formed by the 0,1,2,...999 sequence).
···
--
Francesc Alted
"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."
-- Edsger W. Dykstra
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
For VL data, the dataset itself contains a struct which points to the actual data, which is stored elsewhere. When you apply compression, the "pointer" structs are compressed, but the data itself is not affected.
George Lewandowski
(314)777-7890
Mail Code S270-2204
Building 270-E Level 2E Room 20E
P-8A
···
-----Original Message-----
From: Francesc Alted [mailto:faltet@pytables.org]
Sent: Friday, May 15, 2009 11:21 AM
To: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Compression in variable length datasets not working
A Friday 15 May 2009 17:54:43 Francesc Alted escrigué:
Hi,
A PyTables user is reporting that compression does not have an
appreciable effect on datasets of variable length data types. I can
see that the filters in dataset 'see' are being correctly applied in the HDF5 file:
Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
Attribute: CLASS scalar
Type: 8-byte null-terminated ASCII string
Data: "VLARRAY"
Attribute: VERSION scalar
Type: 4-byte null-terminated ASCII string
Data: "1.3"
Attribute: TITLE scalar
Type: 1-byte null-terminated ASCII string
Data: ""
Location: 0:1:0:976
Links: 1
Modified: 2009-05-15 17:45:39 CEST
Chunks: {2048} 32768 bytes
Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
Filter-0: shuffle-2 OPT {16}
Filter-1: deflate-1 OPT {1}
Type: variable length of
native int
and h5ls is reporting a 3.6x compression rate, but the size of the
file on disk is exactly the same as if not filter compressor is used.
I'm a bit loss here :-/ I'm attaching a sample file with the above
'see' dataset for your inspection. I'd grateful if anybody can tell me what's happening.
Ops, I forgot to add that the dataset should be *highly* compressible (each entry is formed by the 0,1,2,...999 sequence).
--
Francesc Alted
"One would expect people to feel threatened by the 'giant brains or machines that think'. In fact, the frightening computer becomes less frightening if it is used only to simulate a familiar noncomputer."
-- Edsger W. Dykstra
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
A Friday 15 May 2009 19:17:50 Lewandowski, George escrigué:
For VL data, the dataset itself contains a struct which points to the
actual data, which is stored elsewhere. When you apply compression, the
"pointer" structs are compressed, but the data itself is not affected.
Oh, I see. So I should expect compression gains only for ragged arrays that
have few data entries. Nice to know. Thanks!
···
--
Francesc Alted
"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."
-- Edsger W. Dykstra
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
Yes, George is correct. The VL data is stored in a "global heap" in the file, which is not compressed. Someday, we would like to switch the storage for VL data in datasets (and probably attributes) to use the new "fractal heap" code. We'll also probably use one fractal heap per dataset, instead of sharing the VL data for all datasets in one centralized location.
Quincey
···
On May 15, 2009, at 12:33 PM, Francesc Alted wrote:
A Friday 15 May 2009 19:17:50 Lewandowski, George escrigué:
For VL data, the dataset itself contains a struct which points to the
actual data, which is stored elsewhere. When you apply compression, the
"pointer" structs are compressed, but the data itself is not affected.
Oh, I see. So I should expect compression gains only for ragged arrays that
have few data entries. Nice to know. Thanks!
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.