Chunk (and compress) large string

m.diehl · December 16, 2023, 11:27am

Dear all,

I’m writing relatively large text files to HDF5 in Fortran and would like to chunk and compress them. The way I currently do it requires to create a type of the string length. Unfortunately, that means that chunking is not really possible: the only chunk has the size/length of the string.

My use case does not involve reading parts of the string. Still, I was wondering whether chunking would be advantageous, for example for the compression filters. I would appreciate suggestions for implementations.

EDIT: A serious limitation of having one chunk seems to be the limit to 4GiB. Once the string length exceeds this limit, writing a chunked dataset will fail.

An MWE illustrating the current situation is attached.
test.f90 (2.6 KB)

dave.allured · December 16, 2023, 3:00pm

@m.diehl, your MWE works fine for me. It successfully writes a compressed string in a single chunk, and shows file size reduction. My only change was to reduce the demo string length from 2^26 to 2^20, because the larger size crashed my program on my older Mac, for some irrelevant reason.

So I do not understand your question. What problem are you trying to solve by adding chunking? I do not believe that chunking your single long strings would improve compression in the slightest degree. But it would add complexity.

m.diehl · December 17, 2023, 7:27am

@dave.allured thanks for the clarification, good to know that compression is independent of the chunk size.

Actually, my question originated from an issue I encountered a while ago from which I just remembered that it was related to chunking.
After careful checking, I realized that the actual problem is the chunk size: It should not exceed 4GiB. So for very large strings, chunking (and check sums and compression) is not possible.

I’ve updated the title and edited the original question. A small reproducer is given here:
test.f90 (2.7 KB)

m.diehl · December 17, 2023, 10:46am

Note that chunking strings in Python/h5py works, see attached code:
str.py (271 Bytes)

m.diehl · January 14, 2024, 8:02am

@epourmal1 Are there still plans to support compression of non-chunked datasets? That would be a good solution.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Chunk (and compress) large string