Fletcher32 filter on variable length string datasets (not suitable for filters)


#1

Hello,

this is the upstream issue in h5py: https://github.com/h5py/h5py/issues/1948

I am getting this “not suitable for filters” error when working with variable length string datasets since the h5py 3.4.0 release. The current working theory is that this was never actually supported by HDF5.

Could someone with more in-depth knowledge please comment?

Thank you!

Paul


#2

Paul, compression is currently not supported for datatypes with non-fixed-size elements. In the case of variable-length sequences, such as strings, what’s stored on the chunk is are (pointer, count) pairs, the in-file counterpart of hvl_t structures. The pointer part, refers to the sequence “payload,” the actual sequence elements (i.e., characters for strings), which are stored on a heap in the file. That payload is currently not compressed or filtered, i.e., the Fletcher32 checksum would be of the chunks populated by (pointer, count) pairs and NOT the actual sequence elements. OK?

Best, G.


#3

Just to be clear, am I right in understanding that earlier versions of HDF5 would let you configure the fletcher32 filter for datasets of vlen data, but then checksum the (pointer, count) data rather than the vlen data itself? And then in 1.12.1, a check was added which gives an error if you try to create such a dataset?


#4

I don’t remember. I’ll do an experiment and report back.

I think we are seeing a “side effect” of this release note entry.

- Creation of dataset with optional filter

      When the combination of type, space, etc doesn't work for filter
      and the filter is optional, it was supposed to be skipped but it was
      not skipped and the creation failed.

      A fix is applied to allow the creation of a dataset in such
      situation, as specified in the user documentation.

      (BMR - 2020/8/13, HDFFV-10933)

G.


#5

Thanks a lot for elaborating.

I use variable length string datasets to store text data (e.g. logs or notes) in HDF5 files. Is there any way (excluding binary blobs) that would allow me to store text data with compression and fletcher32 filters in HDF5 files? Ideally, HDFView should also display the text properly.


#6

If you had a maximal “line length” or “message size”, you could use fixed-length strings, and HDFView should behave. Otherwise, you could split it into two arrays, a character array and an offset array (delineating strings/notes), but then HDFView, would show you just two arrays.

G.


#7

OK, thank you. I will migrate to fixed-length strings. The text data are not that big, so determining the maximum length should be computationally cheap.