Variable Length Strings Are Not Compressible but Something Is Compressed in File

drozdowski.chris · June 11, 2019, 3:54pm

It is well documented that variable length string datasets cannot be compressed. That is clear and understandable.

But if one chunks and compresses such a dataset (especially a large one), something is getting compressed. The resultant file size is considerably smaller than a file where the dataset is not chunked and compressed.

So, what is actually getting compressed?

steven · June 11, 2019, 5:13pm

As far as I know, variable length strings are stored on the global heap. Which is different from how chunks are stored.

drozdowski.chris · June 11, 2019, 5:40pm

Hi, I should be more specific. I am referring to file size on disk.

Real world example I just produced:
I have a dataset of 1,000,000 variable length strings. I write it all at one time without specifying chunking or compression and immediately close the file. The file size on disk is 42,680,320 bytes.

If I apply chunking and compression to it while writing the dataset, the file size on disk is 26,750,976 bytes.

I do know that the strings themselves are not compressed. So what is being compressed? What could another explanation of the reduced file size?

gheber · June 11, 2019, 5:49pm

The dataset chunks consist of compounds that look in memory like this:

typedef struct {
    size_t len; /* Length of VL data (in base type units) */
    void *p;    /* Pointer to VL data */
} hvl_t;

They will be compressed. The length/address pairs might show sufficient regularity to see a certain degree of compression.

drozdowski.chris · June 12, 2019, 12:41am

Thanks Gerd,

I find that to be a useful bit of insight and the code snippet helps make it even clearer.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Variable Length Strings Are Not Compressible but Something Is Compressed in File