Proposal: shape-aware filter plugins for HDF5 (cross-codec preprocessor with measured wins)

## Question: would STRATA-style shape-aware filters be useful as HDF5 plugins?

Hi all,

I have built a small MIT-licensed lossless preprocessing library called **STRATA**:

https://github.com/rjamesy/strata

STRATA exposes a bank of reversible structural transforms:

- 2D predictors
- 3D axis deltas
- full octahedral cube rotation for 3D volumes, using all 48 orientations
- concentric-shell radial reordering
- YCoCg-R lossless colour decorrelation
- BWT
- an auto-selector that always includes `raw` as a candidate, so the floor is “no worse than the underlying codec alone”

The results are reproducible via:

```bash
python3 bench/preprocess_demo.py

That script writes a CSV covering the tested codec × dataset combinations.

Some current measurements:

Dataset Pipeline Result
Smooth 256×256 heightmap predict_2d_gradient + deflate-9 57.1% smaller than raw deflate-9
64³ CT-style volume cube rotation + radial + zstd-22 13.3–14.4% smaller than raw zstd-22
2252×4000 RGB photo YCoCg-R + zstd-22 35% smaller than raw zstd-22

The 3D volume case is the one I think most overlaps with HDF5 use cases.

The idea is simple: cube rotation can align the smoothest direction of a non-axis-aligned volume with the storage axis before delta-style encoding. For the volume_64.raw CT-style test, this turns a “roughly tied with raw bytes” case into a roughly 14% improvement on top of zstd-22.

This seems structurally aligned with HDF5’s chunk-filter model, so I wanted to ask before doing any plugin work.

Questions

  1. Would these make sense as registered HDF5 filters?

    The cleanest packaging seems to be standalone H5Z plugins such as:

    • predict_2d

    • YCoCg-R

    • radial

    • cube_rotation

    The transforms are small, roughly 30–300 LOC each, and MIT-licensed.

  2. Is an auto-select-with-floor meta-filter appropriate for HDF5?

    HDF5 users can already construct filter chains manually, for example shuffle -> delta -> deflate.

    STRATA’s selector instead tries several reversible candidates per chunk, emits the smallest result, and stores a small mode tag. Since raw is always a candidate, the selector should not lose to the underlying codec-only path.

    Has this pattern been considered before for HDF5 filters? It seems like a natural fit for chunked storage, but I do not want to assume it matches the project’s design philosophy.

  3. What reference datasets should I benchmark?

    My current corpus is small:

    • terrain-like heightmaps

    • synthetic CT-style volumes

    • RGB photos

    • sensor CSV

    I would rather benchmark against data the HDF5 community actually cares about. Candidate areas might be climate-model output, structural simulation, 3D imaging, microscopy stacks, or other chunked scientific arrays.

Relevant links:

No expectation that this belongs inside HDF5 itself. The cleaner home may simply be the filter registry. I mainly wanted to ask whether the approach is useful to the HDF5 ecosystem before packaging the transforms as H5Z plugins.

Best,
Richard James


  1. If each filter performs independent operations, then it makes sense to package them separately.
  2. The filters have enough flexibility in their operation that this should be fine - the mode tag can be included in the filtered output, and and decompression time, the mode tag can be checked to determine the decompression mode.
  3. If you just want a quick test, the chunk benchmark executable (under build/bin/chunk if you build HDF5 from source) creates a small (~4MB) chunked dataset you can try your filters on. Most public scientific datasets are already chunked, so you’d need to use h5repack to decompress them before testing your own filters