HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5

1 Like

Thanks! I opened the referenced github issue, and Iā€™m happy to see this RFC. I read the PDF and skimmed this discussion, though I havenā€™t watched the video.

While the RFC implements 16-bit floats and complex numbers, it doesnā€™t appear to implement them in combination. Iā€™m not sure about C standard compliance, but GCC at least lets me write _Float16 _Complex (example). C++23 has a 16-bit float type, and it seems to play nice with the complex type, so I can write std::complex<std::float16_t>.

I also have use for even weirder types like struct { uint16_t r, i; } to store instrument data that I donā€™t expect to have a corresponding native HDF5 type. Like with quaternions, I think it makes sense to draw the line at types that have support in language standards.

Overall, Iā€™m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me. Iā€™m also glad to see HDF5 types for the most common complex types.

P.S. I should also mention that, for the moment, Iā€™ve worked around the float16 bottleneck by storing data in float32 and zeroing out the least significant mantissa bits. Then I write the dataset with a gzip compression filter and get pretty compact storage that is efficient to read.

While the RFC implements 16-bit floats and complex numbers, it doesnā€™t appear to implement them in combination.

It should be fairly straightforward to support this after support for 16-bit floats is done, but it does add to the testing matrix and makes things a bit messy since the _Float16 type isnā€™t part of the main C standard, so we have to use the type conditionally in the library. Iā€™m open to the idea though.

Overall, Iā€™m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me.

With these conversion paths now in place, the conversion time appears to be a bit less than half of what it was, but itā€™s still around 8x slower than the other parts of your C example because conversions on compound datatypes end up being slow due to repeated ID lookups; conversion between flat 16-bit and 32-bit floating point types is fast though. Iā€™m hoping to be able to optimize these ID lookups as part of implementing support for the complex numbers.

Hello, Iā€™m checking back in to ask what the status of this RFC is? Iā€™ve just come up with a use-case for 16-bit floating point so Iā€™m now looking forward to both features.

Oh I missed the announcement. Thanks! So float16 is there but not complex? Is there a schedule for adding complex support?

Hi @tobias,

I actually plan to have a PR out for adding complex number support to the develop branch by the end of this week, or maybe a few days after if I run into issues. Weā€™re currently discussing what the release of the feature will look like since it will need to go into a major release of HDF5 due to changes in the datatype encoding version number. While not exactly a file format change, the issue is that if the feature goes into a 1.14 release, there would be an awkward situation where users could accidentally create complex number datasets that canā€™t be read by a previous release of 1.14. The library version bounds ā€œhighā€ setting also wouldnā€™t be able to prevent the application from creating an object thatā€™s unreadable with even older versions of the library. We plan to try to make the next major release of HDF5 as easy as possible to upgrade to from the 1.14 releases, with very little in the way of major changes outside of complex numbers.

2 Likes

That sounds great to me. I look forward to 1.15 then!

Amazing news! Could you outline what this support will look like? It sounds like you are implementing it as a new datatype. Will there be built-in facilities for converting to/from existing conventions (like the {.r, .i} compound type used by h5py)?

Hi @peter.hill,

Indeed it made sense after a lot of internal discussion to implement support as a new datatype class. While the suggestion above to implement support for attributes on in-memory datatypes and keep representing complex numbers as compound datatypes with attributes makes a lot of sense, I believe that approach would have been a decent bit more work than this approach and it didnā€™t quite align with the timeline and goals of implementing support for complex numbers. Iā€™m hoping to work on improving the performance of compound datatype conversions in the near future to address some of the concerns around the compound datatype approach. At that point, some specific custom conversion routines should be able to help with conversions between complex number representations until we can maybe look into the compound datatype approach more in the future.

Iā€™m currently working out some last issues surrounding the datatype version encoding change I mentioned previously. Iā€™ve added macros mapping to predefined HDF5 datatypes for both the 3 native C complex number types (float/double/long double _Complex), as well as 6 macros for complex number types of IEEE float formats - F16LE/BE, F32LE/BE and F64LE/BE. Support has been added to h5dump, h5ls and h5diff/ph5diff and they currently print values using ā€œa+biā€ format, but this can be expanded on later after the main code is merged. Note this is just test data, so the values arenā€™t very interesting, but hereā€™s an example:

HDF5 "tcomplex.h5" {
DATASET "/DatasetFloatComplex" {
   DATATYPE  H5T_CPLX_IEEE_F32LE
   DATASPACE  SIMPLE { ( 10, 10 ) / ( 10, 10 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      SIZE 800
      OFFSET 2048
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE  -1+1i
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_LATE
   }
   DATA {
   (0,0): 10+0i, 1+1i, 2+2i, 3+3i, 4+4i, 5+5i, 6+6i, 7+7i, 8+8i, 9+9i,
   (1,0): 9+0i, 1.1+1.1i, 2.1+2.1i, 3.1+3.1i, 4.1+4.1i, 5.1+5.1i, 6.1+6.1i,
   (1,7): 7.1+7.1i, 8.1+8.1i, 9.1+9.1i,
   (2,0): 8+0i, 1.2+1.2i, 2.2+2.2i, 3.2+3.2i, 4.2+4.2i, 5.2+5.2i, 6.2+6.2i,
   (2,7): 7.2+7.2i, 8.2+8.2i, 9.2+9.2i,
   (3,0): 7+0i, 1.3+1.3i, 2.3+2.3i, 3.3+3.3i, 4.3+4.3i, 5.3+5.3i, 6.3+6.3i,
   (3,7): 7.3+7.3i, 8.3+8.3i, 9.3+9.3i,
   (4,0): 6+0i, 1.4+1.4i, 2.4+2.4i, 3.4+3.4i, 4.4+4.4i, 5.4+5.4i, 6.4+6.4i,
   (4,7): 7.4+7.4i, 8.4+8.4i, 9.4+9.4i,
   (5,0): 5+0i, 1.5+1.5i, 2.5+2.5i, 3.5+3.5i, 4.5+4.5i, 5.5+5.5i, 6.5+6.5i,
   (5,7): 7.5+7.5i, 8.5+8.5i, 9.5+9.5i,
   (6,0): 4+0i, 1.6+1.6i, 2.6+2.6i, 3.6+3.6i, 4.6+4.6i, 5.6+5.6i, 6.6+6.6i,
   (6,7): 7.6+7.6i, 8.6+8.6i, 9.6+9.6i,
   (7,0): 3+0i, 1.7+1.7i, 2.7+2.7i, 3.7+3.7i, 4.7+4.7i, 5.7+5.7i, 6.7+6.7i,
   (7,7): 7.7+7.7i, 8.7+8.7i, 9.7+9.7i,
   (8,0): 2+0i, 1.8+1.8i, 2.8+2.8i, 3.8+3.8i, 4.8+4.8i, 5.8+5.8i, 6.8+6.8i,
   (8,7): 7.8+7.8i, 8.8+8.8i, 9.8+9.8i,
   (9,0): 1+0i, 1.9+1.9i, 2.9+2.9i, 3.9+3.9i, 4.9+4.9i, 5.9+5.9i, 6.9+6.9i,
   (9,7): 7.9+7.9i, 8.9+8.9i, 9.9+9.9i
   }
   ATTRIBUTE "AttributeFloatComplex" {
      DATATYPE  H5T_CPLX_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
      DATA {
      (0,0): -1+1i
      }
   }
}
}

Datatype conversions between all the usual C types (int, long, float, long double, etc.) have been added, including _Float16 when support for it is available in the library (though note conversions may be a bit slower since thereā€™s no standard C float16 complex number type currently).

For the conversions between existing conventions, Iā€™ve implemented no-op conversions as long as the data follows these rules (which can also be expanded upon as needed):

  • An array datatype must consist of exactly two elements where each element is of the
    same floating-point datatype as the complex number datatypeā€™s base floating-point
    datatype.

  • A compound datatype must consist of two fields where each field is of the same
    floating-point datatype as the complex number datatypeā€™s base floating-point
    datatype. The compound datatype must not have any leading or trailing structure
    padding or any padding between its two fields. The fields must also have compatible
    names, must have compatible offsets within the datatype and must be in the order
    of ā€œrealā€ part ā†’ ā€œimaginaryā€ part, such that the compound datatype matches the
    following representation:

    H5T_COMPOUND {
        <float_type> "r(e)(a)(l)";                OFFSET 0
        <float_type> "i(m)(a)(g)(i)(n)(a)(r)(y)"; OFFSET SIZEOF("r(e)(a)(l)")
    }
    

    where ā€œr(e)(a)(l)ā€ means the field may be named any substring of ā€œrealā€, such as
    ā€œrā€, or ā€œreā€ and ā€œi(m)(a)(g)(i)(n)(a)(r)(y)ā€ means the field may be named any
    substring of ā€œimaginaryā€, such as ā€œimā€ or ā€œimagā€.

Iā€™ve confirmed the conversions work as expected with test data, but Iā€™m also looking for any real h5py-written data files just to make sure Iā€™m not overlooking anything. Let me know if you can point me to some!

While I donā€™t expect much will change, note that this is all subject to change with review and also if thereā€™s something about support for complex types that doesnā€™t work well for an application or is awkward to use.

3 Likes

Really incredible stuff, thank you so much!

Hereā€™s a real world file generated using h5py:

scotty_output.nc (436.7 KB)

The variable /analysis/H_1_Cardano is complex.

Thanks for the file! I verified that the data can be read directly into a double _Complex buffer (on my machine) through the no-op conversion path and the output matches the data in the file. Hereā€™s the simple C program I used for an example of what this looks like after the changes are merged:

read_scotty_output.c (790 Bytes)

which gives the output:

DATA: [
(1+0i), 
(0.953434+4.33681e-19i), 
(0.909362+8.67362e-19i), 
(0.867738+0i), 
(0.828514-1.73472e-18i), 
(0.791641+0i), 
(0.757062-6.93889e-18i), 
(0.724719+3.46945e-18i), 
(0.694546+0i), 
(0.666468+3.46945e-18i), 
(0.640409-3.46945e-18i), 
(0.616281+6.93889e-18i), 
(0.593994+0i), 
(0.573452-6.93889e-18i), 
(0.554554+6.93889e-18i), 
(0.537199+6.93889e-18i), 
(0.521284+0i), 
(0.506711+1.38778e-17i), 
(0.493378+1.38778e-17i), 
(0.481193+0i), 
(0.470062+2.77556e-17i), 
(0.459902+0i), 
(0.450631+0i), 
(0.442174+1.38778e-17i), 
(0.434462+1.38778e-17i), 
(0.427432+0i), 
(0.421025+4.16334e-17i), 
(0.415189+1.38778e-17i), 
(0.409875-1.38778e-17i), 
(0.405041+1.38778e-17i), 
(0.400647+0i), 
(0.396658+2.77556e-17i), 
(0.393042-4.16334e-17i), 
(0.38977+0i), 
(0.386818-2.77556e-17i), 
(0.384161-2.77556e-17i), 
(0.381781+0i), 
(0.379658-2.77556e-17i), 
(0.377775+2.77556e-17i), 
(0.37612+5.55112e-17i), 
(0.374679+0i), 
(0.373441+0i), 
(0.372396-2.77556e-17i), 
(0.371536+0i), 
(0.370853+0i), 
(0.37034+0i), 
(0.369992+0i), 
(0.369806-2.77556e-17i), 
(0.369776+0i), 
(0.369901+5.55112e-17i), 
(0.370023+2.77556e-17i), 
(0.370178+0i), 
(0.370607+2.77556e-17i), 
(0.371187-2.77556e-17i), 
(0.371918+0i), 
(0.372802-2.77556e-17i), 
(0.373841+0i), 
(0.375038+0i), 
(0.376397-2.77556e-17i), 
(0.377922-2.77556e-17i), 
(0.379619-5.55112e-17i), 
(0.381496+0i), 
(0.38356-2.77556e-17i), 
(0.38582+0i), 
(0.388287+0i), 
(0.390973-2.77556e-17i), 
(0.393891+0i), 
(0.397058+0i), 
(0.400492+0i), 
(0.404213+0i), 
(0.408243+0i), 
(0.412609+0i), 
(0.417338+0i), 
(0.422463+4.16334e-17i), 
(0.428021+0i), 
(0.434052-4.16334e-17i), 
(0.440602+1.38778e-17i), 
(0.447722+0i), 
(0.45547+1.38778e-17i), 
(0.46391+1.38778e-17i), 
(0.473114+2.77556e-17i), 
(0.483162+1.38778e-17i), 
(0.494143+0i), 
(0.506153+4.16334e-17i), 
(0.519302-1.38778e-17i), 
(0.533707+0i), 
(0.549494-2.77556e-17i), 
(0.566803+1.38778e-17i), 
(0.585777+0i), 
(0.606571-1.38778e-17i), 
(0.629344-2.08167e-17i), 
(0.654258-1.38778e-17i), 
(0.681478-6.93889e-18i), 
(0.711165+0i), 
(0.743478+0i), 
(0.778569+3.46945e-18i), 
(0.81658+1.73472e-18i), 
(0.857645-1.73472e-18i), 
(0.901887+0i), 
(0.949421+0i)
]
2 Likes