Fletcher32 filter on variable length string datasets (not suitable for filters)

paul_mueller · October 11, 2021, 2:05pm

Hello,

this is the upstream issue in h5py: https://github.com/h5py/h5py/issues/1948

I am getting this “not suitable for filters” error when working with variable length string datasets since the h5py 3.4.0 release. The current working theory is that this was never actually supported by HDF5.

Could someone with more in-depth knowledge please comment?

Thank you!

Paul

gheber · October 11, 2021, 7:39pm

Paul, compression is currently not supported for datatypes with non-fixed-size elements. In the case of variable-length sequences, such as strings, what’s stored on the chunk is are (pointer, count) pairs, the in-file counterpart of hvl_t structures. The pointer part, refers to the sequence “payload,” the actual sequence elements (i.e., characters for strings), which are stored on a heap in the file. That payload is currently not compressed or filtered, i.e., the Fletcher32 checksum would be of the chunks populated by (pointer, count) pairs and NOT the actual sequence elements. OK?

Best, G.

thomas1 · October 11, 2021, 8:11pm

Just to be clear, am I right in understanding that earlier versions of HDF5 would let you configure the fletcher32 filter for datasets of vlen data, but then checksum the (pointer, count) data rather than the vlen data itself? And then in 1.12.1, a check was added which gives an error if you try to create such a dataset?

gheber · October 11, 2021, 8:58pm

I don’t remember. I’ll do an experiment and report back.

I think we are seeing a “side effect” of this release note entry.

- Creation of dataset with optional filter

      When the combination of type, space, etc doesn't work for filter
      and the filter is optional, it was supposed to be skipped but it was
      not skipped and the creation failed.

      A fix is applied to allow the creation of a dataset in such
      situation, as specified in the user documentation.

      (BMR - 2020/8/13, HDFFV-10933)

G.

paul_mueller · October 11, 2021, 9:15pm

Thanks a lot for elaborating.

I use variable length string datasets to store text data (e.g. logs or notes) in HDF5 files. Is there any way (excluding binary blobs) that would allow me to store text data with compression and fletcher32 filters in HDF5 files? Ideally, HDFView should also display the text properly.

gheber · October 11, 2021, 9:20pm

If you had a maximal “line length” or “message size”, you could use fixed-length strings, and HDFView should behave. Otherwise, you could split it into two arrays, a character array and an offset array (delineating strings/notes), but then HDFView, would show you just two arrays.

G.

paul_mueller · October 11, 2021, 9:31pm

OK, thank you. I will migrate to fixed-length strings. The text data are not that big, so determining the maximum length should be computationally cheap.

gheber · October 19, 2021, 3:54pm

Just to add a data point to the story. It appears that the Fletcher32 checksum is “the odd one out.” Here’s a sample program that will show that compression takes place for variable-length integer sequences.

#include "hdf5.h"

#include <stdio.h>
#include <stdlib.h>

int main()
{
  __label__ fail_file, fail_dtype, fail_dspace, fail_dcpl, fail_dset, fail_write;
  int retval = EXIT_SUCCESS;
  hid_t file, dspace, dtype, dcpl, dset;


  if ((file = H5Fcreate("vlen.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT)) ==
      H5I_INVALID_HID) {
    retval = EXIT_FAILURE;
    goto fail_file;
  }

  if ((dtype = H5Tvlen_create(H5T_STD_I32LE)) == H5I_INVALID_HID) {
    retval = EXIT_FAILURE;
    goto fail_dtype;
  }

  if ((dspace = H5Screate_simple(1, (hsize_t[]){2048},
                                 (hsize_t[]){H5S_UNLIMITED})) ==
      H5I_INVALID_HID) {
    retval = EXIT_FAILURE;
    goto fail_dspace;
  }

  if ((dcpl = H5Pcreate(H5P_DATASET_CREATE)) == H5I_INVALID_HID) {
    retval = EXIT_FAILURE;
    goto fail_dcpl;
  }

  if (H5Pset_chunk(dcpl, 1, (hsize_t[]) {1024}) < 0 ||
      H5Pset_deflate(dcpl, 1) < 0
      //H5Pset_fletcher32(dcpl) < 0
      ) {
    retval = EXIT_FAILURE;
    goto fail_dset;
  }

  if ((dset = H5Dcreate(file, "dset", dtype, dspace, H5P_DEFAULT, dcpl,
                        H5P_DEFAULT)) == H5I_INVALID_HID) {
    retval = EXIT_FAILURE;
    goto fail_dset;
  }

  {
    int data[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    size_t offset[] = {0, 1, 3, 6};
    hvl_t buf[2048];
    size_t i;

    // create an array that looks like this:
    // { {0}, {1,2}, {3,4,5}, {6,7,8,9}, ...}
    for (i = 0; i < 2048; ++i)
      {
        size_t rem = i%4;
        buf[i].len = 1 + rem;
        buf[i].p = data + offset[rem];
      }

    if (H5Dwrite(dset, dtype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf) < 0)
      {
        retval = EXIT_FAILURE;
        goto fail_write;
      }
  }

fail_write:
  H5Dclose(dset);

fail_dset:
  H5Pclose(dcpl);

fail_dcpl:
  H5Sclose(dspace);

fail_dspace:
  H5Tclose(dtype);

fail_dtype:
  H5Fclose(file);

fail_file:
  return retval;
}

The output of h5dump -pBH vlen.h5 looks like this:

HDF5 "vlen.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_VLEN { H5T_STD_I32LE}
      DATASPACE  SIMPLE { ( 2048 ) / ( H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1024 )
         SIZE 5772 (5.677:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 1 }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_ALLOC
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
   }
}
}

We are just compressing the hvl_t elements, which are pretty regular in this case.

For Fletcher32, the code fails with the error stack reported by h5py.

(This is with develop, which stands at HDF5 1.13.0-7.)

G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Fletcher32 filter on variable length string datasets (not suitable for filters)