Std::complex support for H5CPP C++


#1

Complex numbers are prevalent in signal processing and more generally scientific computing. BLAS | LAPACK supports complex numbers therefore it is reasonable to have the ability to store the content of objects with complex datatypes in a straightforward way. While the HDF5 CAPI doesn’t provide predefined complex datatypes ie: H5T_IEEE_C32LE is not present, the library allows creating a custom representation of such types.
Since the in memory layout of std::complex is guaranteed to be contiguous by the c++11 standard section §26.4 see this stack overflow notes the implementation is indeed straightforward and will follow pattern with the newly added half float dataypes

This section is opened to leave comments/observations for the upcoming H5CPP C++ std::complex support.

best wishes
steven


Half-precision floating-point format H5CPP C++
#2

Hi Steven,
You can find our implementation here:

We essentially extended HDF5’s C++ API, and added support for complex float16 as well.
The definitions we used are consistent with h5py’s definition - compound data types. These datatypes are also recognized correctly as complex types for geospatial imagery by GDAL

Piyush


#3

Thanks a lot! I just noticed you already provided the pointer to what I needed. Will take a closer look and provide a compatible implementation.
best:
steve


#4

@werner I moved the comment on complex number here, to keep it in one place.

You are right: complex numbers, dual numbers, quaternions, … oh my! Storing and retrieving them in a unified is hard to do. Instead I take a set of libraries and furnish them with unified efficient IO layer, adding to their work what I have surplus: expert HPC quality tuned IO routines for modern C++. Similarly to the already existing linear algebra layer, if you save one object from one library you should be retrieve it from a different one. What makes is more complex that I also work with Julia and sometimes R and python; and I try to keep doors open for unified object model across platforms.

Let’s look at only complex numbers for now: {real,imaginary}. possible implementations:

  1. a record type in contagious layout having real and imaginary consecutively; fields have some names; great for event based model, 50% throughput is lost when accessing only the imaginary part.
  2. to have a block of imaginary and block of real numbers: this requires filtering technique and provide good balance between subsetting of fields while keeping throughput 100% in both cases. Harder to get it right… ( I have non-public multithreaded implementation for this )
  3. two datasets within a directory: crude but simple and less opaque than the previous, requires interleaving for event based access resulting increased complexity on consumer side.

Indeed only the upcoming release will treat complex numbers in a flexible way providing options for 1 and 3:

using complex = std::complex<long double>;
std::array<complex,N> data; 
auto fd = h5::open(...);
h5::write(fd, "my_data", data, h5::name{"real","imaginary"},
   h5::single | h5::multi, [...]);

Where single | multi refers to whether we want single dataset of HDF5 compound type, or multiple datasets within a group/directory. As you are moving on a tradeoff curve: throughput vs. flexibility I provide choice. While one could argue that is over kill in the case of complex numbers, keep in mind there are other objects following the same pattern: certain sparse matrix layouts, hash tables, … .

To take this further @miller86 Mark Miller pointed out having the advantage to save property lists within the container. Currently H5CPP relies on attributes and custom dapl-s, but to take it to the next level Mark’s ideas are worth exploring.

hope it wasn’t too long,
steve


#5

No, it’s not too long, and it’s shorter than my comments to come.

Firstly, that single vs multi option is not an overkill, but actually quite essential to support different memory layouts of the same semantics. I have the same capability in my F5 library, where it’s called a “separated compound” versus a “contiguous” data field: https://www.fiberbundle.net/doc/group__F5F.html#gaecac106db677394fe48c187d6007fa9a

One of the use cases for such a separated compound / h5::multi layout is also partial data, like when only the real or only the imaginary part is stored in a file, while the other part is constant or that particular data set. Doing that with a contiguous / h5::single layout is not so easy, and it becomes more essential for higher dimensional datatypes of course.

Btw., talking about higher dimensional data types: The choice of “real” and “imaginary” for the complex numbers is a bad choice because that naming notation does not scale well to higher dimensions. In 3D, there are three imaginary numbers, so an “imaginary” component wouldn’t be unique any more. Within the context of quaternion algebra, they call those three imaginary numbers i,j,k. But that’s also not a scalable notation: in 4D, there will be six imaginary numbers. And actually those imaginary numbers just correspond to the number of possible planes that can be constructed in an n-dimensional space: in 2D, there is only one possible plane, the XY plane. In 3D, there are three possibilities, the XY, YZ and ZX planes. In 4D, using T as a 4th coordinate, there are XY, YZ, ZX, XT, YT and ZT, i.e. six planes, which have the algebraic properties of the imaginary unit within the framework of Geometric Algebra. The 4D case corresponds to bi-quaternions or complex quaternion. The scheme easily extends to arbitrary higher dimensions. So, if you don’t name the components of a complex number {“real”,“imaginary”}, but {“x”,“y”} instead, or {“x1”,“x2”}, or {“e1”,“e2”} or such a similar scheme, then they are automatically part of a bigger scheme: any quaternion or bi-quaternion (or GA)-library could just read complex numbers as a quaternion (or higher dimensional spinor), with the higher dimensional components being just zero for this dataset, but they are interpretable without any more special consideration, just part of the scheme. Other way round, a quaternion algorithm can write its data and a library only knowing complex numbers in 2D can read the 2D subset right away.

A unified object model is certainly a very desirable goal to achieve - so unifying data even on a mathematical level first makes total sense here. That is just the objective of Geometric Algebra, which I like to advocate at this opportunity. For instance, see http://geocalc.clas.asu.edu/html/Overview.html ; especially Hestenes’ Oersted Medal lecture - as referenced from that page - is a good inspiring first read. Utilizing a scaleable notation that places mathematical objects into a broader context isn’t any performance impact, HDF5 doesn’t care how objects are named, but a scalable scheme opens the door wider to more applications on the semantic level.

However, there is an aspect that impacts performance considerations: When you think about reading data with mixed precisions. If you just use complex, then everything is easy, as its just one data type. But the complexity comes in if some data set is stored as complex - which makes a factor 2 after all in storage space. Now if you want an application that is able to read complex datasets, then the binary operation - e.g. just “+” - needs to support complex + complex, complex + complex, complex + complex, complex + complex, leading to quite the type explosion and lengthy compile times with larger binaries if all binary operators support all possibly combinations at runtime, e.g. implemented by iterating over type lists. Even more fun when supporting also long double and all the diverse integer numerical types here… see also section 3.2.3 in https://www.researchgate.net/publication/315881612_Massive_Geometric_Algebra_Visions_for_C_Implementations_of_Geometric_Algebra_to_Scale_into_the_Big_Data_Era on this matter. Even more, HDF5 could store the real part of a complex number in double precision and the imaginary part in single precision. That makes total sense to save storage space when possible even though it’s not supported by C++ complex<> numbers. So in the end, writing data is rather easy, but reading data is the hard part.

So to make life simpler, it’s just easier to read all floating point data into double precision, i.e. reading complex as complex, so the application only needs to deal with one single type, instead of suffering under the type explosion to handle all possible combinations. HDF5 can already do that.

But actually I ran into troubles here: Reading a struct { float x,y; } compound type from file into a struct { double x,y; }; turned out to be extremely slow, about 100x slower than reading floats as float or doubles as doubles. It is fast for non-compound data sets ( separated compound / h5::multi data layout), but converting members of compound data types was so slow, I ended up implemented the conversion myself without letting HDF5 doing it. Have you encountered similar issues? Maybe it’s fixed by now, that was my observation with HDF5 1.8.17. Just saying, there may be performance issues in areas where you don’t expect them…

     Werner

#6

Hello Werner,

…But actually I ran into troubles here [ … ]extremely slow […]. Have you encountered similar issues? Yes, in response I wrote an independent bare metal data pipeline based on BLAS LEVEL 3 blocking and released some part of it when The HDFGroup moved the private direct chunk read|write into public API. Here is the documentation and, presentation slide. [many thanks to @gheber Gerd Heber directing my attention to those API calls] This mechanism has exceptional IO properties but will require the newest HDF5 CAPI or back-porting this modification to allow custom dapl working properly.

This is currently supported with compiler assisted reflection, ISC’19 slides are here but you can’t use std::complex<T> unless do a copy. Also BLAS/LAPACK will not process this custom complex number format.

...
struct my_complex{
   double x;
   float y;
}
auto data = h5::read<std::vector<my_complex>>(fd, "dataset" [, ...]);`
...

Complex numbers of higher dimensions: if you have interest we could discuss this over a conference call?

best,
steve


#7

Sure. H5CPP looks interesting as well as you address and solve various issues that I need to be solved, too, so maybe I can utilize it in my applications, too. Currently I’m using a C library on top of HDF5 to specify a data layout within a broader context to handle entire topological schemes between geometric objects, where data types are just the minor component of defining fields on topological spaces and relationships between them. It seems H5CPP can interact well with the the C API, so maybe it can be used for that. The main challenge though is more about reading data while covering many cases at runtime within the same application.

Possibly, consider meeting at the CGI20 conference https://www.cgs-network.org/cgi20/#features where I’ll be involved in organizing a workshop on Geometric Algebra (it’s not yet announced on their website, but will come soon).

      Werner