2-bit integers supported?

nickp · February 27, 2020, 5:41am

I want to store a large rectangular matrix of 2-bit unsigned integers.
Efficient storage is important – so 4 entries/byte.

Is this possible? I’m an HDF5 newbie so apologies if this is well covered.

gheber · February 27, 2020, 3:06pm

HDF5 user-defined atomic types won’t do the packing. An opaque type might be your best bet.You’d have to keep some metadata to document your layout. G.

steven · February 27, 2020, 5:12pm

Yes, here is an example in C++ with H5CPP Notice the H5T_OPAQUE type length is set to 1 or single byte.

h5dump -pH example.h5

HDF5 "example.h5" {
GROUP "/" {
   DATASET "data" {
      DATATYPE  H5T_OPAQUE {
         OPAQUE_TAG "bitstring::two_bit";
      }
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 5
         OFFSET 2048
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}
}

and the abridged version how to make it happen.

[...]
#include <h5cpp/core> // include this before custom type definition
/* you would place these in separate header file */
namespace bitstring {
	struct two_bit { // wrapper to aid C++ template mechanism, zero runtime cost
		[...]
		unsigned char value;
	};
}

// BEGIN H5CPP SPECIFIC CUSTOM TYPE DEFINITION
namespace h5::impl::detail {
	template <> struct hid_t<bitstring::two_bit, H5Tclose,true,true, hdf5::type> : public dt_p<bitstring::two_bit> {
		using parent = dt_p<bitstring::two_bit>;  // h5cpp needs the following typedefs
		using parent::hid_t;
		using hidtype = bitstring::two_bit;

		// opaque doesn't care of byte order, also since you are using single byte
		// it is not relevant
		hid_t() : parent( H5Tcreate( H5T_OPAQUE, 1) ) { // 1 == single byte, i would pack it into 64 bit though
			H5Tset_tag(handle, "bitstring::two_bit");
			hid_t id = static_cast<hid_t>( *this );
		}
	};
}
namespace h5 {
	template <> struct name<bitstring::two_bit> {
		static constexpr char const * value = "bitstring::two_bit";
	};
}
// END H5CPP SPECIFIC TYPE DEFINITION
#include <h5cpp/io> // IO operators become aware of your custom type

int main(){
	namespace nm = bitstring;

	h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
	// prints out type info, eases on debugging
	std::cout << h5::dt_t<nm::two_bit>() << std::endl;

	std::vector<nm::two_bit> vec = {0xff,0x0f,0xf0,0x00,0b0001'1011};

	/* H5CPP operators are aware of your dataype, will do the right thing
	 */
	h5::write(fd,"data", vec); // single shot write
	auto data = h5::read<std::vector<nm::two_bit>>(fd, "data");

	for( int i=0; i<vec.size(); i++ )
		std::cout << "[" << i << ": " << vec[i] << " "  <<"]";
	std::cout << "\n\ncomputing difference ||saved - read|| expecting norm to be zero:\n";
	for( int i=0; i<vec.size(); i++ )
		std::cout << abs(vec[i].value - data[i].value) <<" ";
}

H5CPP docs are here
best:steve

dave.allured · February 27, 2020, 7:26pm

Also consider the N-bit filter with bit size, i.e. precision, = 2 bits. See Users Guide section 5.6.1. Once the empty data set is created, you would read and write values as ordinary integer data types such as 8, 16, or 32 bit integers.

The advantage is that byte packing, unpacking, and indexing become invisible to the user program. The filter takes care of all that. The matrix is accessed with the user’s native matrix indices 0…N-1 with no translation needed for reading or writing.

This depends on how the matrix is stored in user program memory. If the matrix is already dense packed 4 elements per byte, then you would need to unpack before using this HDF5 filter method. This method would work best if storage in the user program is simply native 8, 16, or 32 bit integers.

nickp · February 27, 2020, 8:49pm

Thanks you (and Dave Allured) for these super helpful comments.

Nick

steven · February 28, 2020, 2:02am

Hey Nick, here is the H5CPP + nbit version of @dave.allured idea. Not aware of the performance difference between these methods – if you want to be thorough you may want to examine compression filters as well.

In any event this is your custom type descriptor:

namespace h5::impl::detail {
	template <> struct hid_t<bitstring::n_bit, H5Tclose,true,true, hdf5::type> : public dt_p<bitstring::n_bit> {
		using parent = dt_p<bitstring::n_bit>;  // h5cpp needs the following typedefs
		using parent::hid_t;
		using hidtype = bitstring::n_bit;

		// opaque doesn't care of byte order, also since you are using single byte
		// it is not relevant
		hid_t() : parent( H5Tcopy( H5T_NATIVE_UCHAR) ) {
			H5Tset_precision(handle, 2);
			hid_t id = static_cast<hid_t>( *this );
		}
	};
}

Then you have choice whether to use STL, armaadillo, eigen, or something else. In any event you have to grab the pointer to the memory location, or in some cases it works directly as this eigen3 example shows:

namespace ei {
	template <class T>
	using Matrix   = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
}
namespace bs = bitstring;

[...]
ei::Matrix<bs::n_bit> M(12,8);
[...]
// create dataset, and hold on to `ds` handle
h5::ds_t ds = h5::create<bs::n_bit>(fd, "eigen", // chunk must be used with nbit
   h5::current_dims{12,8}, h5::max_dims{12,H5S_UNLIMITED}, h5::chunk{3,4} | h5::nbit);
// you can do efficient partial IO or just write the data in single shot:
h5::write(ds, M);
// read back data
ei::Matrix<bs::n_bit> data(12,8);
h5::read(fd, "eigen", data, h5::offset{0,0}); 
// control position, size is known/computed from datastructure

happy computing:
steve

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

2-bit integers supported?