4-bit / 2-bit datatype support

Synchrotron X-ray detectors are getting faster (and subsequently detect less photons) and stream images in 4 bit or even 2 bit (packed) format. Is there a way to store the data efficiently and portably in HDF5 without expending the data to 1 byte?

Using H5T_OPAQUE kind of defeat the self-documented nature of HDF5.

I am not too familiar with bitfield, but it looks like it offers packed storage of bits. Could that be a solution?

For the records, a similar question was asked 4 years ago.

1 Like

Sam, how are you? Combining a bitfield datatype and a suitable n-bit filter will strip the padding bits and store only the significant bits. See Using the Nā€bit Filter. OK?

Best, G.

Hi Gerd, I am good, thanks for asking! I hope everything is awesome on your side too.

Reading the Nā€bit Filter doc:

Unless it is packed, an n-bit datatype is presented as an n-bit bitfield within a larger-sized value. For example, a 12-bit datatype might be presented as a 12-bit field in a 16-bit, or 2-byte, value.

But what if the data are packed in memory (our use case) and we want to keep that way? Then I guess the n-bit filter would not be necessary? But how to define the memory datatype?

hid_t 4bit_datatype = H5Tcopy(H5T_STD_I8LE);
H5Tset_precision(nbit_datatype, 4);
H5Tset_offset(nbit_datatype, 0);

AFAIU, that would define a 4bit type that uses the first 4bit of a byte but what about the other half of the byte?

Correct, the packing applies only to the file, not to memory. I think you have three options:

  1. You live with the bloat in memory but control memory pressure by processing your data in batches/chunks.
  2. You could use your packed in-memory representation but use an opaque type as the memory type. You would then need to write (and register) a soft datatype conversion that temporarily ā€œinflatesā€ your packed in-memory representation into a bitfield.
  3. You write your pre-packed chunks directly into the file, which, as part of its metadata, would register the n-bit filter. Of course, if you read from this dataset, youā€™d face the same problem in reverse. You can either accept the bloat, have a soft datatype conversion that can handle the bitfield ā†’ opaque conversion, or read the chunk directly into memory and deal with the raw chunk.

Does that make sense?

G.

Absolutely, thank you!

I think we should go with option 3 since we have made the effort to write our (basic) processing with support for n-bit packed in-memory datasets.

Reading the datasets should be easy for users -there is no real-time constraint like we have on the data acquisition side. I mean the expansion to 8 bit would be acceptable and user friendly. And since this option also gives the possibility for users who want to use the 4-bit packed in-memory datasets to read the raw chunks, it sounds like a good compromise.

Iā€™ll give it a try!