Variable Length Strings Are Not Compressible but Something Is Compressed in File


#1

It is well documented that variable length string datasets cannot be compressed. That is clear and understandable.

But if one chunks and compresses such a dataset (especially a large one), something is getting compressed. The resultant file size is considerably smaller than a file where the dataset is not chunked and compressed.

So, what is actually getting compressed?


#2

As far as I know, variable length strings are stored on the global heap. Which is different from how chunks are stored.


#3

Hi, I should be more specific. I am referring to file size on disk.

Real world example I just produced:
I have a dataset of 1,000,000 variable length strings. I write it all at one time without specifying chunking or compression and immediately close the file. The file size on disk is 42,680,320 bytes.

If I apply chunking and compression to it while writing the dataset, the file size on disk is 26,750,976 bytes.

I do know that the strings themselves are not compressed. So what is being compressed? What could another explanation of the reduced file size?


#4

The dataset chunks consist of compounds that look in memory like this:

typedef struct {
    size_t len; /* Length of VL data (in base type units) */
    void *p;    /* Pointer to VL data */
} hvl_t;

They will be compressed. The length/address pairs might show sufficient regularity to see a certain degree of compression.


#5

Thanks Gerd,

I find that to be a useful bit of insight and the code snippet helps make it even clearer.