Variable Length Strings Are Not Compressible but Something Is Compressed in File


It is well documented that variable length string datasets cannot be compressed. That is clear and understandable.

But if one chunks and compresses such a dataset (especially a large one), something is getting compressed. The resultant file size is considerably smaller than a file where the dataset is not chunked and compressed.

So, what is actually getting compressed?


As far as I know, variable length strings are stored on the global heap. Which is different from how chunks are stored.


Hi, I should be more specific. I am referring to file size on disk.

Real world example I just produced:
I have a dataset of 1,000,000 variable length strings. I write it all at one time without specifying chunking or compression and immediately close the file. The file size on disk is 42,680,320 bytes.

If I apply chunking and compression to it while writing the dataset, the file size on disk is 26,750,976 bytes.

I do know that the strings themselves are not compressed. So what is being compressed? What could another explanation of the reduced file size?


The dataset chunks consist of compounds that look in memory like this:

typedef struct {
    size_t len; /* Length of VL data (in base type units) */
    void *p;    /* Pointer to VL data */
} hvl_t;

They will be compressed. The length/address pairs might show sufficient regularity to see a certain degree of compression.


Thanks Gerd,

I find that to be a useful bit of insight and the code snippet helps make it even clearer.