Synchrotron X-ray detectors are getting faster (and subsequently detect less photons) and stream images in 4 bit or even 2 bit (packed) format. Is there a way to store the data efficiently and portably in HDF5 without expending the data to 1 byte?
Using H5T_OPAQUE kind of defeat the self-documented nature of HDF5.
I am not too familiar with bitfield, but it looks like it offers packed storage of bits. Could that be a solution?
Sam, how are you? Combining a bitfield datatype and a suitable n-bit filter will strip the padding bits and store only the significant bits. See Using the Nābit Filter. OK?
Hi Gerd, I am good, thanks for asking! I hope everything is awesome on your side too.
Reading the Nābit Filter doc:
Unless it is packed, an n-bit datatype is presented as an n-bit bitfield within a larger-sized value. For example, a 12-bit datatype might be presented as a 12-bit field in a 16-bit, or 2-byte, value.
But what if the data are packed in memory (our use case) and we want to keep that way? Then I guess the n-bit filter would not be necessary? But how to define the memory datatype?
Correct, the packing applies only to the file, not to memory. I think you have three options:
You live with the bloat in memory but control memory pressure by processing your data in batches/chunks.
You could use your packed in-memory representation but use an opaque type as the memory type. You would then need to write (and register) a soft datatype conversion that temporarily āinflatesā your packed in-memory representation into a bitfield.
You write your pre-packed chunks directly into the file, which, as part of its metadata, would register the n-bit filter. Of course, if you read from this dataset, youād face the same problem in reverse. You can either accept the bloat, have a soft datatype conversion that can handle the bitfield ā opaque conversion, or read the chunk directly into memory and deal with the raw chunk.
I think we should go with option 3 since we have made the effort to write our (basic) processing with support for n-bit packed in-memory datasets.
Reading the datasets should be easy for users -there is no real-time constraint like we have on the data acquisition side. I mean the expansion to 8 bit would be acceptable and user friendly. And since this option also gives the possibility for users who want to use the 4-bit packed in-memory datasets to read the raw chunks, it sounds like a good compromise.