## Question: would STRATA-style shape-aware filters be useful as HDF5 plugins?
Hi all,
I have built a small MIT-licensed lossless preprocessing library called **STRATA**:
https://github.com/rjamesy/strata
STRATA exposes a bank of reversible structural transforms:
- 2D predictors
- 3D axis deltas
- full octahedral cube rotation for 3D volumes, using all 48 orientations
- concentric-shell radial reordering
- YCoCg-R lossless colour decorrelation
- BWT
- an auto-selector that always includes `raw` as a candidate, so the floor is “no worse than the underlying codec alone”
The results are reproducible via:
```bash
python3 bench/preprocess_demo.py
That script writes a CSV covering the tested codec × dataset combinations.
Some current measurements:
| Dataset | Pipeline | Result |
|---|---|---|
| Smooth 256×256 heightmap | predict_2d_gradient + deflate-9 |
57.1% smaller than raw deflate-9 |
| 64³ CT-style volume | cube rotation + radial + zstd-22 |
13.3–14.4% smaller than raw zstd-22 |
| 2252×4000 RGB photo | YCoCg-R + zstd-22 |
35% smaller than raw zstd-22 |
The 3D volume case is the one I think most overlaps with HDF5 use cases.
The idea is simple: cube rotation can align the smoothest direction of a non-axis-aligned volume with the storage axis before delta-style encoding. For the volume_64.raw CT-style test, this turns a “roughly tied with raw bytes” case into a roughly 14% improvement on top of zstd-22.
This seems structurally aligned with HDF5’s chunk-filter model, so I wanted to ask before doing any plugin work.
Questions
-
Would these make sense as registered HDF5 filters?
The cleanest packaging seems to be standalone H5Z plugins such as:
-
predict_2d -
YCoCg-R -
radial -
cube_rotation
The transforms are small, roughly 30–300 LOC each, and MIT-licensed.
-
-
Is an auto-select-with-floor meta-filter appropriate for HDF5?
HDF5 users can already construct filter chains manually, for example
shuffle -> delta -> deflate.STRATA’s selector instead tries several reversible candidates per chunk, emits the smallest result, and stores a small mode tag. Since
rawis always a candidate, the selector should not lose to the underlying codec-only path.Has this pattern been considered before for HDF5 filters? It seems like a natural fit for chunked storage, but I do not want to assume it matches the project’s design philosophy.
-
What reference datasets should I benchmark?
My current corpus is small:
-
terrain-like heightmaps
-
synthetic CT-style volumes
-
RGB photos
-
sensor CSV
I would rather benchmark against data the HDF5 community actually cares about. Candidate areas might be climate-model output, structural simulation, 3D imaging, microscopy stacks, or other chunked scientific arrays.
-
Relevant links:
-
Cross-codec results: https://github.com/rjamesy/strata/blob/main/bench/results/F10_PREPROCESS_RESULTS.md
-
v0.4.0 release: https://github.com/rjamesy/strata/releases/tag/v0.4.0
No expectation that this belongs inside HDF5 itself. The cleaner home may simply be the filter registry. I mainly wanted to ask whether the approach is useful to the HDF5 ecosystem before packaging the transforms as H5Z plugins.
Best,
Richard James
