Compression in variable length datasets not working

Hi,

A PyTables user is reporting that compression does not have an appreciable
effect on datasets of variable length data types. I can see that the filters
in dataset 'see' are being correctly applied in the HDF5 file:

Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
    Attribute: CLASS scalar
        Type: 8-byte null-terminated ASCII string
        Data: "VLARRAY"
    Attribute: VERSION scalar
        Type: 4-byte null-terminated ASCII string
    Attribute: TITLE scalar
        Type: 1-byte null-terminated ASCII string
        Data: ""
    Location: 0:1:0:976
    Links: 1
    Modified: 2009-05-15 17:45:39 CEST
    Chunks: {2048} 32768 bytes
    Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
    Filter-0: shuffle-2 OPT {16}
    Filter-1: deflate-1 OPT {1}
    Type: variable length of
                   native int

and h5ls is reporting a 3.6x compression rate, but the size of the file on
disk is exactly the same as if not filter compressor is used. I'm a bit loss
here :-/ I'm attaching a sample file with the above 'see' dataset for your
inspection. I'd grateful if anybody can tell me what's happening.

Thanks,

vldata.h5 (206 KB)

···

Data: "1.3"

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra

A Friday 15 May 2009 17:54:43 Francesc Alted escrigué:

Hi,

A PyTables user is reporting that compression does not have an appreciable
effect on datasets of variable length data types. I can see that the
filters in dataset 'see' are being correctly applied in the HDF5 file:

Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
    Attribute: CLASS scalar
        Type: 8-byte null-terminated ASCII string
        Data: "VLARRAY"
    Attribute: VERSION scalar
        Type: 4-byte null-terminated ASCII string
        Data: "1.3"
    Attribute: TITLE scalar
        Type: 1-byte null-terminated ASCII string
        Data: ""
    Location: 0:1:0:976
    Links: 1
    Modified: 2009-05-15 17:45:39 CEST
    Chunks: {2048} 32768 bytes
    Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
    Filter-0: shuffle-2 OPT {16}
    Filter-1: deflate-1 OPT {1}
    Type: variable length of
                   native int

and h5ls is reporting a 3.6x compression rate, but the size of the file on
disk is exactly the same as if not filter compressor is used. I'm a bit
loss here :-/ I'm attaching a sample file with the above 'see' dataset for
your inspection. I'd grateful if anybody can tell me what's happening.

Ops, I forgot to add that the dataset should be *highly* compressible (each
entry is formed by the 0,1,2,...999 sequence).

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

For VL data, the dataset itself contains a struct which points to the actual data, which is stored elsewhere. When you apply compression, the "pointer" structs are compressed, but the data itself is not affected.

George Lewandowski
(314)777-7890
Mail Code S270-2204
Building 270-E Level 2E Room 20E
P-8A

···

-----Original Message-----
From: Francesc Alted [mailto:faltet@pytables.org]
Sent: Friday, May 15, 2009 11:21 AM
To: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Compression in variable length datasets not working

A Friday 15 May 2009 17:54:43 Francesc Alted escrigué:

Hi,

A PyTables user is reporting that compression does not have an
appreciable effect on datasets of variable length data types. I can
see that the filters in dataset 'see' are being correctly applied in the HDF5 file:

Opened "/tmp/vldata.h5" with sec2 driver.
see Dataset {50/Inf}
    Attribute: CLASS scalar
        Type: 8-byte null-terminated ASCII string
        Data: "VLARRAY"
    Attribute: VERSION scalar
        Type: 4-byte null-terminated ASCII string
        Data: "1.3"
    Attribute: TITLE scalar
        Type: 1-byte null-terminated ASCII string
        Data: ""
    Location: 0:1:0:976
    Links: 1
    Modified: 2009-05-15 17:45:39 CEST
    Chunks: {2048} 32768 bytes
    Storage: 800 logical bytes, 220 allocated bytes, 363.64% utilization
    Filter-0: shuffle-2 OPT {16}
    Filter-1: deflate-1 OPT {1}
    Type: variable length of
                   native int

and h5ls is reporting a 3.6x compression rate, but the size of the
file on disk is exactly the same as if not filter compressor is used.
I'm a bit loss here :-/ I'm attaching a sample file with the above
'see' dataset for your inspection. I'd grateful if anybody can tell me what's happening.

Ops, I forgot to add that the dataset should be *highly* compressible (each entry is formed by the 0,1,2,...999 sequence).

--
Francesc Alted

"One would expect people to feel threatened by the 'giant brains or machines that think'. In fact, the frightening computer becomes less frightening if it is used only to simulate a familiar noncomputer."

-- Edsger W. Dykstra

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Friday 15 May 2009 19:17:50 Lewandowski, George escrigué:

For VL data, the dataset itself contains a struct which points to the
actual data, which is stored elsewhere. When you apply compression, the
"pointer" structs are compressed, but the data itself is not affected.

Oh, I see. So I should expect compression gains only for ragged arrays that
have few data entries. Nice to know. Thanks!

···

--
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'. In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Yes, George is correct. The VL data is stored in a "global heap" in the file, which is not compressed. Someday, we would like to switch the storage for VL data in datasets (and probably attributes) to use the new "fractal heap" code. We'll also probably use one fractal heap per dataset, instead of sharing the VL data for all datasets in one centralized location.

  Quincey

···

On May 15, 2009, at 12:33 PM, Francesc Alted wrote:

A Friday 15 May 2009 19:17:50 Lewandowski, George escrigué:

For VL data, the dataset itself contains a struct which points to the
actual data, which is stored elsewhere. When you apply compression, the
"pointer" structs are compressed, but the data itself is not affected.

Oh, I see. So I should expect compression gains only for ragged arrays that
have few data entries. Nice to know. Thanks!

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.