4-bit / 2-bit datatype support

Synchrotron X-ray detectors are getting faster (and subsequently detect less photons) and stream images in 4 bit or even 2 bit (packed) format. Is there a way to store the data efficiently and portably in HDF5 without expending the data to 1 byte?

Using H5T_OPAQUE kind of defeat the self-documented nature of HDF5.

I am not too familiar with bitfield, but it looks like it offers packed storage of bits. Could that be a solution?

For the records, a similar question was asked 4 years ago.

1 Like

Sam, how are you? Combining a bitfield datatype and a suitable n-bit filter will strip the padding bits and store only the significant bits. See Using the N‐bit Filter. OK?

Best, G.

Hi Gerd, I am good, thanks for asking! I hope everything is awesome on your side too.

Reading the N‐bit Filter doc:

Unless it is packed, an n-bit datatype is presented as an n-bit bitfield within a larger-sized value. For example, a 12-bit datatype might be presented as a 12-bit field in a 16-bit, or 2-byte, value.

But what if the data are packed in memory (our use case) and we want to keep that way? Then I guess the n-bit filter would not be necessary? But how to define the memory datatype?

hid_t 4bit_datatype = H5Tcopy(H5T_STD_I8LE);
H5Tset_precision(nbit_datatype, 4);
H5Tset_offset(nbit_datatype, 0);

AFAIU, that would define a 4bit type that uses the first 4bit of a byte but what about the other half of the byte?

Correct, the packing applies only to the file, not to memory. I think you have three options:

  1. You live with the bloat in memory but control memory pressure by processing your data in batches/chunks.
  2. You could use your packed in-memory representation but use an opaque type as the memory type. You would then need to write (and register) a soft datatype conversion that temporarily ā€œinflatesā€ your packed in-memory representation into a bitfield.
  3. You write your pre-packed chunks directly into the file, which, as part of its metadata, would register the n-bit filter. Of course, if you read from this dataset, you’d face the same problem in reverse. You can either accept the bloat, have a soft datatype conversion that can handle the bitfield → opaque conversion, or read the chunk directly into memory and deal with the raw chunk.

Does that make sense?

G.

Absolutely, thank you!

I think we should go with option 3 since we have made the effort to write our (basic) processing with support for n-bit packed in-memory datasets.

Reading the datasets should be easy for users -there is no real-time constraint like we have on the data acquisition side. I mean the expansion to 8 bit would be acceptable and user friendly. And since this option also gives the possibility for users who want to use the 4-bit packed in-memory datasets to read the raw chunks, it sounds like a good compromise.

I’ll give it a try!