I have a device that provides packed 12-bit unsigned integers in little endian order. That is the byte sequence [0x1a, 0x2b, 0x3c] represents two 12-bit integers, [0xb1a, 0x3c2]. I would like to write this data to a HDF5 dataset with as little processing as possible, while allowing the user to retrieve the 12-bit integers easily.
The best solution I have come with so far is to create a dataset of little endian 24-bit unsigned integers. With the bytes above, the 24-bit integer would be 0x3c2b1a.
Another solution I have considered is writing the raw bytes directly in chunks and setting the NBit filter. However, the Nbit compressed data seems to have big endianness. The NBit packed byte sequence appears to be [0xb1, 0xa3, 0xc2] when checking a raw chunk.
Is there a way to write [0x1a, 0x2b, 0x3c] to a HDF5 file and have it have be read as [0xb1a, 0x3c2]?
Most common CPUs do not handle 12-bit integers natively, at least not that I’m aware of. End-users therefore, will always convert whatever packed format you choose into something like 32-bit integers (perhaps some CPUs can handle 16 bit efficiently) to do anything non-trivial.
Why not save the end-users the trouble of doing conversions and just store the integers as 32-bit ints in the HDF5 file? This wouldn’t require additional programming work beyond what you already propose.
Eventually the data would be unpacked into 16-bit integers for the end user. The problem is that I might not have time to unpack the data during acquisition. Thr goal is to avoid data duplication.
Compilers seem happy enough dealing with 3-byte, 24 bit integers. HDF5 lets me create a 3 byte unsigned integer datatype with 12-bit precision. That can be offset by 12-bits, so I can access either the even or odd integers.
Is there way to create a virtual dataset with a different type than the original dataset? In this case I would have two virtual datasets with 12-bit precision pointing at a 24-bit dataset but selecting either the upper or lower 12 bits
Why not just convert the 12-bit unsigned integers individually into 16-bit unsigned integers, then write the latter directly into HDF5? This is one of the HDF5 pre-defined data types. This satisfies both of your goals “as little processing as possible” and easy retrieval as 16-bit integers for the end user. Endianness will be handled automatically for you.
You can optionally enable the N-bit filter on write, to tell HDF5 to internally pack to 12 bits and save space. Read back will automatically unpack, with no special action needed by the end user.
Yes, this makes sense when I can perform compression after acquisition. I have developed SIMD based routines to unpack the 12-bit integers to 16-bit integers, but this leaves little time for compresssion while acquiring the data.
Without compression, the data enlarges by a factor of 1.33 which can be problematic at tera to peta scales.
To narrow the question, is there a better way than describing this packed 12-bit format as 24-bit unsigned integers?
If H5CPP is an option, basically you have to give me the specification for your custom datatype, preferably modifying any of the provided examples; and I show you how to do it. Please do provide the Julia code as well – our workflow may be similar: re-exporting to “C” compile the shared object then call it from julia. For good performance pack them as a vector – I am sure you know this if you can handle SIMD.
I meant to post this much earlier, but got distracted.
The essence is essentially the code below in Julia.
dt = API.h5t_copy(API.H5T_STD_U32LE)
API.h5t_set_size(dt, 3) # This is basically the U24LE data type at this point
u24le = API.h5t_copy(dt)
API.h5t_set_precision(dt, 12) # This allows to grab the lower 12-bits as a (16-bit) integer
lower_u12le = API.h5t_copy(dt)
API.h5t_set_offset(dt, 12) # This allows me to grab the upper 12-bits as a (16-bit) integer
upper_u12le = API.h5t_copy(dt)
If I wanted both numbers, then I use the u24le type, and select the values using either dset .& 0xfff or dset .>>> 12. The other strategy I have is to keep the data as an external file, and then create two datasets one with a lower_u12le datatype and the other with a upper_u12le. With either approach, I would have interleave the data manually.