HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5

gheber · February 15, 2024, 4:23pm

hawkins.brian · February 27, 2024, 1:11am

Thanks! I opened the referenced github issue, and I’m happy to see this RFC. I read the PDF and skimmed this discussion, though I haven’t watched the video.

While the RFC implements 16-bit floats and complex numbers, it doesn’t appear to implement them in combination. I’m not sure about C standard compliance, but GCC at least lets me write _Float16 _Complex (example). C++23 has a 16-bit float type, and it seems to play nice with the complex type, so I can write std::complex<std::float16_t>.

I also have use for even weirder types like struct { uint16_t r, i; } to store instrument data that I don’t expect to have a corresponding native HDF5 type. Like with quaternions, I think it makes sense to draw the line at types that have support in language standards.

Overall, I’m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me. I’m also glad to see HDF5 types for the most common complex types.

P.S. I should also mention that, for the moment, I’ve worked around the float16 bottleneck by storing data in float32 and zeroing out the least significant mantissa bits. Then I write the dataset with a gzip compression filter and get pretty compact storage that is efficient to read.

jhenderson · February 27, 2024, 2:09am

While the RFC implements 16-bit floats and complex numbers, it doesn’t appear to implement them in combination.

It should be fairly straightforward to support this after support for 16-bit floats is done, but it does add to the testing matrix and makes things a bit messy since the _Float16 type isn’t part of the main C standard, so we have to use the type conditionally in the library. I’m open to the idea though.

Overall, I’m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me.

With these conversion paths now in place, the conversion time appears to be a bit less than half of what it was, but it’s still around 8x slower than the other parts of your C example because conversions on compound datatypes end up being slow due to repeated ID lookups; conversion between flat 16-bit and 32-bit floating point types is fast though. I’m hoping to be able to optimize these ID lookups as part of implementing support for the complex numbers.

tobias · June 27, 2024, 10:01am

Hello, I’m checking back in to ask what the status of this RFC is? I’ve just come up with a use-case for 16-bit floating point so I’m now looking forward to both features.

gheber · June 27, 2024, 1:30pm

tobias · June 27, 2024, 6:13pm

Oh I missed the announcement. Thanks! So float16 is there but not complex? Is there a schedule for adding complex support?

jhenderson · June 27, 2024, 6:57pm

Hi @tobias,

I actually plan to have a PR out for adding complex number support to the develop branch by the end of this week, or maybe a few days after if I run into issues. We’re currently discussing what the release of the feature will look like since it will need to go into a major release of HDF5 due to changes in the datatype encoding version number. While not exactly a file format change, the issue is that if the feature goes into a 1.14 release, there would be an awkward situation where users could accidentally create complex number datasets that can’t be read by a previous release of 1.14. The library version bounds “high” setting also wouldn’t be able to prevent the application from creating an object that’s unreadable with even older versions of the library. We plan to try to make the next major release of HDF5 as easy as possible to upgrade to from the 1.14 releases, with very little in the way of major changes outside of complex numbers.

tobias · June 27, 2024, 7:11pm

That sounds great to me. I look forward to 1.15 then!

peter.hill · July 1, 2024, 9:52am

Amazing news! Could you outline what this support will look like? It sounds like you are implementing it as a new datatype. Will there be built-in facilities for converting to/from existing conventions (like the {.r, .i} compound type used by h5py)?

jhenderson · July 1, 2024, 4:53pm

Hi @peter.hill,

Indeed it made sense after a lot of internal discussion to implement support as a new datatype class. While the suggestion above to implement support for attributes on in-memory datatypes and keep representing complex numbers as compound datatypes with attributes makes a lot of sense, I believe that approach would have been a decent bit more work than this approach and it didn’t quite align with the timeline and goals of implementing support for complex numbers. I’m hoping to work on improving the performance of compound datatype conversions in the near future to address some of the concerns around the compound datatype approach. At that point, some specific custom conversion routines should be able to help with conversions between complex number representations until we can maybe look into the compound datatype approach more in the future.

I’m currently working out some last issues surrounding the datatype version encoding change I mentioned previously. I’ve added macros mapping to predefined HDF5 datatypes for both the 3 native C complex number types (float/double/long double _Complex), as well as 6 macros for complex number types of IEEE float formats - F16LE/BE, F32LE/BE and F64LE/BE. Support has been added to h5dump, h5ls and h5diff/ph5diff and they currently print values using “a+bi” format, but this can be expanded on later after the main code is merged. Note this is just test data, so the values aren’t very interesting, but here’s an example:

HDF5 "tcomplex.h5" {
DATASET "/DatasetFloatComplex" {
   DATATYPE  H5T_CPLX_IEEE_F32LE
   DATASPACE  SIMPLE { ( 10, 10 ) / ( 10, 10 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      SIZE 800
      OFFSET 2048
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE  -1+1i
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_LATE
   }
   DATA {
   (0,0): 10+0i, 1+1i, 2+2i, 3+3i, 4+4i, 5+5i, 6+6i, 7+7i, 8+8i, 9+9i,
   (1,0): 9+0i, 1.1+1.1i, 2.1+2.1i, 3.1+3.1i, 4.1+4.1i, 5.1+5.1i, 6.1+6.1i,
   (1,7): 7.1+7.1i, 8.1+8.1i, 9.1+9.1i,
   (2,0): 8+0i, 1.2+1.2i, 2.2+2.2i, 3.2+3.2i, 4.2+4.2i, 5.2+5.2i, 6.2+6.2i,
   (2,7): 7.2+7.2i, 8.2+8.2i, 9.2+9.2i,
   (3,0): 7+0i, 1.3+1.3i, 2.3+2.3i, 3.3+3.3i, 4.3+4.3i, 5.3+5.3i, 6.3+6.3i,
   (3,7): 7.3+7.3i, 8.3+8.3i, 9.3+9.3i,
   (4,0): 6+0i, 1.4+1.4i, 2.4+2.4i, 3.4+3.4i, 4.4+4.4i, 5.4+5.4i, 6.4+6.4i,
   (4,7): 7.4+7.4i, 8.4+8.4i, 9.4+9.4i,
   (5,0): 5+0i, 1.5+1.5i, 2.5+2.5i, 3.5+3.5i, 4.5+4.5i, 5.5+5.5i, 6.5+6.5i,
   (5,7): 7.5+7.5i, 8.5+8.5i, 9.5+9.5i,
   (6,0): 4+0i, 1.6+1.6i, 2.6+2.6i, 3.6+3.6i, 4.6+4.6i, 5.6+5.6i, 6.6+6.6i,
   (6,7): 7.6+7.6i, 8.6+8.6i, 9.6+9.6i,
   (7,0): 3+0i, 1.7+1.7i, 2.7+2.7i, 3.7+3.7i, 4.7+4.7i, 5.7+5.7i, 6.7+6.7i,
   (7,7): 7.7+7.7i, 8.7+8.7i, 9.7+9.7i,
   (8,0): 2+0i, 1.8+1.8i, 2.8+2.8i, 3.8+3.8i, 4.8+4.8i, 5.8+5.8i, 6.8+6.8i,
   (8,7): 7.8+7.8i, 8.8+8.8i, 9.8+9.8i,
   (9,0): 1+0i, 1.9+1.9i, 2.9+2.9i, 3.9+3.9i, 4.9+4.9i, 5.9+5.9i, 6.9+6.9i,
   (9,7): 7.9+7.9i, 8.9+8.9i, 9.9+9.9i
   }
   ATTRIBUTE "AttributeFloatComplex" {
      DATATYPE  H5T_CPLX_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
      DATA {
      (0,0): -1+1i
      }
   }
}
}

Datatype conversions between all the usual C types (int, long, float, long double, etc.) have been added, including _Float16 when support for it is available in the library (though note conversions may be a bit slower since there’s no standard C float16 complex number type currently).

For the conversions between existing conventions, I’ve implemented no-op conversions as long as the data follows these rules (which can also be expanded upon as needed):

An array datatype must consist of exactly two elements where each element is of the
same floating-point datatype as the complex number datatype’s base floating-point
datatype.
A compound datatype must consist of two fields where each field is of the same
floating-point datatype as the complex number datatype’s base floating-point
datatype. The compound datatype must not have any leading or trailing structure
padding or any padding between its two fields. The fields must also have compatible
names, must have compatible offsets within the datatype and must be in the order
of “real” part → “imaginary” part, such that the compound datatype matches the
following representation:
```
H5T_COMPOUND {
    <float_type> "r(e)(a)(l)";                OFFSET 0
    <float_type> "i(m)(a)(g)(i)(n)(a)(r)(y)"; OFFSET SIZEOF("r(e)(a)(l)")
}
```
where “r(e)(a)(l)” means the field may be named any substring of “real”, such as
“r”, or “re” and “i(m)(a)(g)(i)(n)(a)(r)(y)” means the field may be named any
substring of “imaginary”, such as “im” or “imag”.

I’ve confirmed the conversions work as expected with test data, but I’m also looking for any real h5py-written data files just to make sure I’m not overlooking anything. Let me know if you can point me to some!

While I don’t expect much will change, note that this is all subject to change with review and also if there’s something about support for complex types that doesn’t work well for an application or is awkward to use.

peter.hill · July 2, 2024, 1:07pm

Really incredible stuff, thank you so much!

Here’s a real world file generated using h5py:

scotty_output.nc (436.7 KB)

The variable /analysis/H_1_Cardano is complex.

jhenderson · July 2, 2024, 3:46pm

Thanks for the file! I verified that the data can be read directly into a double _Complex buffer (on my machine) through the no-op conversion path and the output matches the data in the file. Here’s the simple C program I used for an example of what this looks like after the changes are merged:

read_scotty_output.c (790 Bytes)

which gives the output:

DATA: [
(1+0i), 
(0.953434+4.33681e-19i), 
(0.909362+8.67362e-19i), 
(0.867738+0i), 
(0.828514-1.73472e-18i), 
(0.791641+0i), 
(0.757062-6.93889e-18i), 
(0.724719+3.46945e-18i), 
(0.694546+0i), 
(0.666468+3.46945e-18i), 
(0.640409-3.46945e-18i), 
(0.616281+6.93889e-18i), 
(0.593994+0i), 
(0.573452-6.93889e-18i), 
(0.554554+6.93889e-18i), 
(0.537199+6.93889e-18i), 
(0.521284+0i), 
(0.506711+1.38778e-17i), 
(0.493378+1.38778e-17i), 
(0.481193+0i), 
(0.470062+2.77556e-17i), 
(0.459902+0i), 
(0.450631+0i), 
(0.442174+1.38778e-17i), 
(0.434462+1.38778e-17i), 
(0.427432+0i), 
(0.421025+4.16334e-17i), 
(0.415189+1.38778e-17i), 
(0.409875-1.38778e-17i), 
(0.405041+1.38778e-17i), 
(0.400647+0i), 
(0.396658+2.77556e-17i), 
(0.393042-4.16334e-17i), 
(0.38977+0i), 
(0.386818-2.77556e-17i), 
(0.384161-2.77556e-17i), 
(0.381781+0i), 
(0.379658-2.77556e-17i), 
(0.377775+2.77556e-17i), 
(0.37612+5.55112e-17i), 
(0.374679+0i), 
(0.373441+0i), 
(0.372396-2.77556e-17i), 
(0.371536+0i), 
(0.370853+0i), 
(0.37034+0i), 
(0.369992+0i), 
(0.369806-2.77556e-17i), 
(0.369776+0i), 
(0.369901+5.55112e-17i), 
(0.370023+2.77556e-17i), 
(0.370178+0i), 
(0.370607+2.77556e-17i), 
(0.371187-2.77556e-17i), 
(0.371918+0i), 
(0.372802-2.77556e-17i), 
(0.373841+0i), 
(0.375038+0i), 
(0.376397-2.77556e-17i), 
(0.377922-2.77556e-17i), 
(0.379619-5.55112e-17i), 
(0.381496+0i), 
(0.38356-2.77556e-17i), 
(0.38582+0i), 
(0.388287+0i), 
(0.390973-2.77556e-17i), 
(0.393891+0i), 
(0.397058+0i), 
(0.400492+0i), 
(0.404213+0i), 
(0.408243+0i), 
(0.412609+0i), 
(0.417338+0i), 
(0.422463+4.16334e-17i), 
(0.428021+0i), 
(0.434052-4.16334e-17i), 
(0.440602+1.38778e-17i), 
(0.447722+0i), 
(0.45547+1.38778e-17i), 
(0.46391+1.38778e-17i), 
(0.473114+2.77556e-17i), 
(0.483162+1.38778e-17i), 
(0.494143+0i), 
(0.506153+4.16334e-17i), 
(0.519302-1.38778e-17i), 
(0.533707+0i), 
(0.549494-2.77556e-17i), 
(0.566803+1.38778e-17i), 
(0.585777+0i), 
(0.606571-1.38778e-17i), 
(0.629344-2.08167e-17i), 
(0.654258-1.38778e-17i), 
(0.681478-6.93889e-18i), 
(0.711165+0i), 
(0.743478+0i), 
(0.778569+3.46945e-18i), 
(0.81658+1.73472e-18i), 
(0.857645-1.73472e-18i), 
(0.901887+0i), 
(0.949421+0i)
]

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5