HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5

w.benger · February 7, 2024, 9:29am

A complex (or in general, vector) data type would only be an array data type if it’s homogeneous. If you want to store the real part in double precision and the imaginary part in single precision, it is no longer the case. As compound data types, HDF5 would automatically convert such a mixed data type into a pair of double precision or single precision. This is a great already existing feature, and with those mixed color-space data types like 10 bit for red,green,blue plus 2 bit for alpha, it becomes even more inhomogenous. Re-implementing these data transformations via a new native datatype, with all those possible combinations, can that ever be effective?

The only issue seems that having a float x,y,z; compound type in a file and reading it as double x,y,z; compound data type is currently very slow, and storing data as float[3] and reading as double[3] is much faster. But is there a reason why this could not be optimized for the same performance? The library should be able to detect that float x,y,z; is in-memory compatible with float[3] and then provide the same performance, leaving the “slow”, generic conversion to the more exotic cases.

As a compound data type, HDF5 files remain very well self-describing, provided a meaningful semantics is used to define the compound data types. A while ago I wrote a paper discussing how data types such those as used in Geometric Algebra (the generalization of complex numbers and quaternions to arbitrary dimensions) and tensors (as used in General Relativity) can modelled via HDF5: http://sciviz.cct.lsu.edu/papers/2009/GraVisMa09.pdf ( skip forward to section 5.2 if you want to skip the math / physics part ). Using named data types with attributes serves quite useful there, only the technical difficulty is that they can only be used on data types bound to a specific file, rather than applied to in-memory data types. This makes I/O on such data types rather effortsome - if this part could be remedied such that also in-memory data types can hold attributes, it would address that issue well and solve lots of troubles. Attributes on data types are much more powerful than simple tagging (and they are already supported in HDF5 files).

Just to mention about Geometric Algebra here: only with this framework you would have a complete set of operations on vector data types in arbitrary dimensions. Quaternions are a subset of Geometric Algebra in 3D, complex numbers are the 2D case. However, there are different “flavors” of Geometric Algebra, some fields in computer graphics uses conformal geometric algebra in five dimensions to quite some success. It would be too much for HDF5 to support it “natively” in all flavors, but it would totally make sense for HDF5 to support a basis framework on which such flavors can build, with complex numbers just being the start of it, but not as an independent, special case. There are also alternatives to Geometric Algebra when it comes to generalize complex numbers and quaternions (like octonions, septenions, …).

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag. Tagging makes sense for opaque data types, but with a compound data type we already have the capability to “parse” a data type description, actually it is already “parsed” as a compound. All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector. That also allows - via compound types - reading a 3D cartesian vector as 2D cartesian vector, even if is stored as “z,x,y” or whatever order. This is a great feature that HDF5 already provides. What may well be desireable (beside the mentioned performance optimizations) is the ability to define transformation “filters” that allow to interpret a {float x,y;} compound as a {float re,im;} structure (i.e., define them as equivalent), or even more to perform actual coordinate transformations {float x,y;} → {float r, phi;}, as mentioned for complex numbers as well. The implementation of such transformations could go well into an addon-library, rather than being core HDF5, which should only handle the infrastructure for defining data types (possibly with attributes on transient types), equivalence relationships and support transformation rules.

tobias · February 7, 2024, 3:01pm

I am very happy to see this RFC. I am interested in complex number support and am happy with the proposed approach.

My only comment is to consider that the HDF5 tools output complex numbers as x+yi or x+yj format, which is used in various places. However the only objection to the proposed Complex { real: imag: } I have is the verbosity, I am very much bikeshedding, and will happily accept this format if it leads to a speedy adoption of the RFC.

jhenderson · February 7, 2024, 4:15pm

I can at least answer a few of these questions right away, while leaving others open-ended for now and keeping them under consideration:

I wonder about supporting full quad precision? I have seen more of that lately in HPC/CSE than I recall seeing in the past.

I see no issue in general with supporting quad precision types in HDF5; it mostly comes down to how well supported they are across platforms and compilers. If there are a variety of platform/compiler combinations that don’t have good support, it becomes a bit more painful to support in the library, especially when it comes to the type conversion code.

Would it make sense to make these more exotic/advanced types optional? I mean, is there anything to be gained for average users (common cases) where they do not have to deal with these types in their builds? Probably not but it doesn’t hurt to ask

Since the library keeps around global H5T_t structures in memory for each of the native and standard types available, as well as the library IDs for those types, the main benefit of the types being optional would seem to be less pressure on memory and available library IDs, though I’d expect that to be minimal (sizeof(H5T_t) = 104, currently). However, I could certainly see us just making templates available for creating/using some of these types as an alternative, likely through the HL interface.

Given the common ways others have already tried to support complex types, would it make sense to offer some HL functionality that makes it easy for them to convert (on read or on write) between these common cases and the new native types? Maybe some extra “filters” for this? I dunno

I’d certainly like to make it as easy as possible to convert from existing data into these types (and vice-versa) if we go this way with them. The thought is to provide datatype conversion paths that will accept a compound type in any of the forms used in the nc-complex doc mentioned in this thread, and directly convert between that and the C99 types, since they should be memory compatible.

Does it make sense for HDF5 to impose a basis (such as real/imag) in the file or does it make sense for HDF5 to allow whichever bases make sense and something in the file indicates which the basis is?

When writing the RFC, I was mostly sticking exactly to the C standard on complex types (with the ordering being real → imag) to give the best opportunity for efficient datatype conversions between the on-disk and in-memory formats. But as the discussion in this thread seems to be moving more toward thoughts around general vector types, I do think the ability to specify ordering for the vector components is likely going to be needed for those types. I’m still of the inclination to have HDF5’s support for complex numbers map directly to the C types for portability reasons, but this thread is giving me a lot of things to think about around future support for different datatypes.

Will you support any primitive (int, float) type as the base type for complex numbers? For example, can you have complex numbers with 16 bit floats for the components or long doubles?

I’m imagining a new H5T API for creating complex numbers, say H5Tcomplex_create, that takes a base datatype ID as a parameter. For simplicity, the three main types of float, double and long double would likely be the only supported supported types at first, but this should allow for expansion in the future. For portability reasons it’s easier to support just these three at the moment, but I know GCC supports other base types, for example. I haven’t researched this for other compilers yet.

The rest of your questions are certainly among the things that need to be considered for this functionality, though again I’ve only been thinking of complex numbers in terms of the C standard for wide portability across platforms and languages. There’s definitely a broader discussion to be had here around general vector types.

jhenderson · February 7, 2024, 7:54pm

A complex (or in general, vector) data type would only be an array data type if it’s homogeneous.

Indeed, I should say that the intent has been for initial complex number support in HDF5 to map directly to the C standard for portability reasons, in which case the type would be homogeneous and both components of the complex type would be stored with the same precision. Though, based on your insights and comments from @miller86 as well, I see that there’s a larger discussion to be had around creating a basis for more general vector types.

Re-implementing these data transformations via a new native datatype, with all those possible combinations, can that ever be effective?

I’d think not and would like to not run up against the combinatorial explosion there, but I also wasn’t (based on my previous comment) thinking of support for vector values in a more general fashion. The only reasonable way to properly support general vector types would seem to be with compound datatypes rather than a new native type. The reason why complex numbers are being considered as a special case here stems mostly from their direct support in the C standard. If they are implemented as a native type, conversion routines would be written for each of the more interesting combinations, but an HDF5 user could also always define their own conversion routine if they have a specific case they want handled efficiently. Granted, it is of course much more convenient if HDF5 can do this automatically.

The only issue seems that having a float x,y,z; compound type in a file and reading it as double x,y,z; compound data type is currently very slow, and storing data as float[3] and reading as double[3] is much faster. But is there a reason why this could not be optimized for the same performance? The library should be able to detect that float x,y,z; is in-memory compatible with float[3] and then provide the same performance, leaving the “slow”, generic conversion to the more exotic cases.

Part of this work would involve fixing up the library to recognize some of these types of situations, regardless of the final on-disk format for complex number types, since there is of course an advantage to being able to directly convert from a float x, y; compound form to the C99 float _Complex type that’s just the same as float[2]. Converting from float x, y, z; to float[3] and converting from float x, y, z; to double[3] are just extensions on that, and could likely fall within the scope of this work.

As a compound data type, HDF5 files remain very well self-describing, provided a meaningful semantics is used to define the compound data types. A while ago I wrote a paper discussing how data types such those as used in Geometric Algebra (the generalization of complex numbers and quaternions to arbitrary dimensions) and tensors (as used in General Relativity) can modelled via HDF5: http://sciviz.cct.lsu.edu/papers/2009/GraVisMa09.pdf ( skip forward to section 5.2 if you want to skip the math / physics part ). Using named data types with attributes serves quite useful there, only the technical difficulty is that they can only be used on data types bound to a specific file, rather than applied to in-memory data types. This makes I/O on such data types rather effortsome - if this part could be remedied such that also in-memory data types can hold attributes, it would address that issue well and solve lots of troubles. Attributes on data types are much more powerful than simple tagging (and they are already supported in HDF5 files).

Thanks for the link to the paper! It’s always interesting reading about how people are using HDF5. Of course it seems that the issue lies in “provided a meaningful semantics is used to define the compound data types”. More discussion below between the attribute approach and possible alternatives…

Just to mention about Geometric Algebra here: only with this framework you would have a complete set of operations on vector data types in arbitrary dimensions. Quaternions are a subset of Geometric Algebra in 3D, complex numbers are the 2D case. However, there are different “flavors” of Geometric Algebra, some fields in computer graphics uses conformal geometric algebra in five dimensions to quite some success. It would be too much for HDF5 to support it “natively” in all flavors, but it would totally make sense for HDF5 to support a basis framework on which such flavors can build, with complex numbers just being the start of it, but not as an independent, special case. There are also alternatives to Geometric Algebra when it comes to generalize complex numbers and quaternions (like octonions, septenions, …).

I must admit that I was missing the background on these higher order types, but it makes perfect sense that we want to consider building the framework for supporting these. From an API perspective, it seems like the H5Tcomplex_create API routine mentioned above could be a special case of a more general one for higher order types, but those are just my initial musings on the topic.

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag.

Forgive me if this is a naive interpretation, but isn’t this similar to the approach discussed in your paper, in that one would need to parse information from the attributes attached to the named datatype used for data objects? If that’s the case, one could argue that there’s no need to add an alternative approach when yours suffices, but I’d say that it seems more natural to get information about a datatype directly from that type (e.g., through H5T API routines), rather than needing to store the type in the file and retrieve information about it through the H5A API. Yes, one would need to parse the tag associated with a compound type, but the idea would be that HDF5 could have a standard set of tag strings for various standardized data formats.

I can see how that doesn’t exactly generalize well compared to your approach with attributes though. Perhaps if there are common bits of information that you’re currently storing in attributes attached to named datatypes (such as the attribute shown in your paper that contains the dimensions of the data), these can be accounted for with an H5T API to make retrieving that information a little more straightforward? There would likely need to be some file format changes or adaptations so that this information could be stored in the file with a datatype structure, but I think that could also make it reasonable for that information to be associated with in-memory datatypes. Again though, it would probably be a bit less flexible than the attribute approach.

All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector.

Agreed that we need some sort of convention for representing a standardized type through use of a compound datatype, which is what I was (perhaps poorly) attempting to work towards with compound type tags. Though it may have no basis in reality, my concern around choosing “a compound type with members of this type named in this way” as a convention is that we may risk stepping on the toes of other standardized type formats in the future if they would have similarly named compound members. The tag approach just avoids placing importance on the naming of compound type members.

What may well be desireable (beside the mentioned performance optimizations) is the ability to define transformation “filters” that allow to interpret a {float x,y;} compound as a {float re,im;} structure (i.e., define them as equivalent), or even more to perform actual coordinate transformations {float x,y;} → {float r, phi;}, as mentioned for complex numbers as well. The implementation of such transformations could go well into an addon-library, rather than being core HDF5, which should only handle the infrastructure for defining data types (possibly with attributes on transient types), equivalence relationships and support transformation rules.

I think moving the implementations for these transformations elsewhere, either into a separate library or perhaps in HDF5’s high-level library, is an interesting idea worth exploring and could allow for much more specialized implementations where possible.

Having said all that, thank you for the great points for consideration!

jhenderson · February 7, 2024, 8:02pm

My only comment is to consider that the HDF5 tools output complex numbers as x+yi or x+yj format, which is used in various places. However the only objection to the proposed Complex { real: imag: } I have is the verbosity, I am very much bikeshedding, and will happily accept this format if it leads to a speedy adoption of the RFC.

Sure, we definitely aren’t settled on any particular format currently, so if people feel this is a better and more useful representation then I believe we’d be happy to use it. I mostly just want to see if there are alternate forms people may wish to see other than x+yi.

lori.cooper · February 7, 2024, 8:19pm

Sorry for the double post, but I wanted it to be in this thread, too.

Audio only podcast version here.

w.benger · February 8, 2024, 4:12pm

Just to provide a bit of background on the higher order types, particularly the extension of complex numbers to arbitrary dimensions: For complex numbers, you have members like {re,im}. If you’d generalize this to a complex number of complex numbers, it would become a four-component compound like {re.re, re.im, im.re, im.im}. That naming scheme does not make much sense to understand what those components actually mean. Interpreting those four components as quaternions, they would be named something like { scalar, i,j,k } with “scalar” corresponds to the real part, and i,j,k the three “imaginary” components.

In the framework of Geometric Algebra we would name those components like this:

2D: scalar, x^y → complex number
3D: scalar, x^y, y^z, z^x → quaternion
4D: scalar, x^y, y^z, z^x, x^t, y^t, z^t → bi-quaternion
5D: …

which extends to arbitrary dimensions, and those vector types are subsets of the most general so-called multivector, which has 2^n components in its full form:

2D: scalar, x, y, x^y
3D: scalar, x,y,z, x^y, y^z, z^x, x^y^z
4D: scalar, x,y,z,t, x^y, y^z, z^x, x^t, y^t, z^t, x^y^z, x^y^t, y^z^t, z^x^t, x^y^z^t
5D: …

There is quite some work on using geometric algebra in 5D, e.g. https://geometricalgebra.org/ .

This merely as demonstration of a generalized framework, in which both quaternions and complex numbers are just special, low-dimensional cases. Of course, there can still be different conventions on how to name the basis, e.g. some prefer to use e_1, e_2, e_3 instead of x,y,z for even more generalization beyond Euclidean space. Also, there are other generalizations beside Geometric Algebra.

The cool thing about using such a scheme with compound types in HDF5 is that we could always read a lower-dimensional subset automatically from a higher-dimensional dataset. For instance, if a dataset is stored as a 3D quaternion, but one application only wants to read the 2D subset {scalar, x^y} out of the full {scalar, x^y, y^z, z^x} dataset, then this already works right away with HDF5 how it is, no changes needed to the library (beyond performance issues). It would certainly be cool if the envisioned “future” complex numbers allow the same functionality.

jhenderson:

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag.

Forgive me if this is a naive interpretation, but isn’t this similar to the approach discussed in your paper, in that one would need to parse information from the attributes attached to the named datatype used for data objects? If that’s the case, one could argue that there’s no need to add an alternative approach when yours suffices, but I’d say that it seems more natural to get information about a datatype directly from that type (e.g., through H5T API routines), rather than needing to store the type in the file and retrieve information about it through the H5A API. Yes, one would need to parse the tag associated with a compound type, but the idea would be that HDF5 could have a standard set of tag strings for various standardized data formats.

The essential information in the approach as in the paper is not in the attributes, but in the names of the components. So one just needs to know whether there is an “x” and a “y” component in the data type to find out that this dataset contains a 2D vector. This approach does not require any parsing, only checking the properties of the name datatype with the HDF5 API and comparing the names of the components. So this is done entirely through the H5T API.

Attributes on the named datatypes via the H5A are optional information - such as a reference to the coordinate system that is related to those x,y coordinates, and storing the origin of that one, for example. This information is not essential to know to find out what type that field is, but only information required later to determine how to perform coordinate transformations, for instance (or, physical units). So, this also does not require any string parsing.

Also, the benefit of not having tags, or a fixed list of strings, is that any {x,y,z} data type can be identified as a {x,y} data type subset automatically. Reading a semantically compatible subset comes “for free”. If one really wants to distinguish x,y,z from x,y then one can still also read and compare the number of components.

There are properties such as “grade”, “co-variance”, “contra-variance” and such that come with general tensor-like quantities. These may not always be required to read and operate on a dataset. So, those can be “optional” in this sense, and placed into attributes. It is fine to not read them if they are not needed. But definitely, it would be great if attributes could also be stored on transient, in-memory data types! Currently it is an inconvenient limitation that only named on-disk data types can store attributes. However, such an extension of the HDF5 library could not even require any changes to the H5T API, it would just mean that things work that currently do not (i.e., attributes on transient data types).

jhenderson:

All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector.

Agreed that we need some sort of convention for representing a standardized type through use of a compound datatype, which is what I was (perhaps poorly) attempting to work towards with compound type tags. Though it may have no basis in reality, my concern around choosing “a compound type with members of this type named in this way” as a convention is that we may risk stepping on the toes of other standardized type formats in the future if they would have similarly named compound members. The tag approach just avoids placing importance on the naming of compound type members.

Hm, I would just favor that, placing importance in the naming of compound type members, rather than changing the HDF5 API and the file format. It should be acceptable to introduce a convention that e.g. compound type names starting with _ or H5__ are reserved, similar to the C standard also reserving identifiers that start with _ . Or, maybe at most, introduce a bool tag in the H5T API telling a data type is an “HDF5 data type” versus a user-defined data type?

And support for the myriads of coordinate transformations out there would be great for the geosciences area! But definitely, you can’t have them all implemented in the core library, those should rather go through a plugin system similar to the compression filters.

ajelenak · February 8, 2024, 5:13pm

I think this could be handled with a command line option --imaginary-unit. The default would be i but that could be changed to j with --imaginary-unit j.

Aleksandar

gheber · February 15, 2024, 4:23pm

hawkins.brian · February 27, 2024, 1:11am

Thanks! I opened the referenced github issue, and I’m happy to see this RFC. I read the PDF and skimmed this discussion, though I haven’t watched the video.

While the RFC implements 16-bit floats and complex numbers, it doesn’t appear to implement them in combination. I’m not sure about C standard compliance, but GCC at least lets me write _Float16 _Complex (example). C++23 has a 16-bit float type, and it seems to play nice with the complex type, so I can write std::complex<std::float16_t>.

I also have use for even weirder types like struct { uint16_t r, i; } to store instrument data that I don’t expect to have a corresponding native HDF5 type. Like with quaternions, I think it makes sense to draw the line at types that have support in language standards.

Overall, I’m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me. I’m also glad to see HDF5 types for the most common complex types.

P.S. I should also mention that, for the moment, I’ve worked around the float16 bottleneck by storing data in float32 and zeroing out the least significant mantissa bits. Then I write the dataset with a gzip compression filter and get pretty compact storage that is efficient to read.

jhenderson · February 27, 2024, 2:09am

While the RFC implements 16-bit floats and complex numbers, it doesn’t appear to implement them in combination.

It should be fairly straightforward to support this after support for 16-bit floats is done, but it does add to the testing matrix and makes things a bit messy since the _Float16 type isn’t part of the main C standard, so we have to use the type conditionally in the library. I’m open to the idea though.

Overall, I’m quite happy with the RFC since it offers a fast code path for float16-to-float32 conversions, which was the biggest pain point for me.

With these conversion paths now in place, the conversion time appears to be a bit less than half of what it was, but it’s still around 8x slower than the other parts of your C example because conversions on compound datatypes end up being slow due to repeated ID lookups; conversion between flat 16-bit and 32-bit floating point types is fast though. I’m hoping to be able to optimize these ID lookups as part of implementing support for the complex numbers.

tobias · June 27, 2024, 10:01am

Hello, I’m checking back in to ask what the status of this RFC is? I’ve just come up with a use-case for 16-bit floating point so I’m now looking forward to both features.

gheber · June 27, 2024, 1:30pm

tobias · June 27, 2024, 6:13pm

Oh I missed the announcement. Thanks! So float16 is there but not complex? Is there a schedule for adding complex support?

jhenderson · June 27, 2024, 6:57pm

Hi @tobias,

I actually plan to have a PR out for adding complex number support to the develop branch by the end of this week, or maybe a few days after if I run into issues. We’re currently discussing what the release of the feature will look like since it will need to go into a major release of HDF5 due to changes in the datatype encoding version number. While not exactly a file format change, the issue is that if the feature goes into a 1.14 release, there would be an awkward situation where users could accidentally create complex number datasets that can’t be read by a previous release of 1.14. The library version bounds “high” setting also wouldn’t be able to prevent the application from creating an object that’s unreadable with even older versions of the library. We plan to try to make the next major release of HDF5 as easy as possible to upgrade to from the 1.14 releases, with very little in the way of major changes outside of complex numbers.

tobias · June 27, 2024, 7:11pm

That sounds great to me. I look forward to 1.15 then!

peter.hill · July 1, 2024, 9:52am

Amazing news! Could you outline what this support will look like? It sounds like you are implementing it as a new datatype. Will there be built-in facilities for converting to/from existing conventions (like the {.r, .i} compound type used by h5py)?

jhenderson · July 1, 2024, 4:53pm

Hi @peter.hill,

Indeed it made sense after a lot of internal discussion to implement support as a new datatype class. While the suggestion above to implement support for attributes on in-memory datatypes and keep representing complex numbers as compound datatypes with attributes makes a lot of sense, I believe that approach would have been a decent bit more work than this approach and it didn’t quite align with the timeline and goals of implementing support for complex numbers. I’m hoping to work on improving the performance of compound datatype conversions in the near future to address some of the concerns around the compound datatype approach. At that point, some specific custom conversion routines should be able to help with conversions between complex number representations until we can maybe look into the compound datatype approach more in the future.

I’m currently working out some last issues surrounding the datatype version encoding change I mentioned previously. I’ve added macros mapping to predefined HDF5 datatypes for both the 3 native C complex number types (float/double/long double _Complex), as well as 6 macros for complex number types of IEEE float formats - F16LE/BE, F32LE/BE and F64LE/BE. Support has been added to h5dump, h5ls and h5diff/ph5diff and they currently print values using “a+bi” format, but this can be expanded on later after the main code is merged. Note this is just test data, so the values aren’t very interesting, but here’s an example:

HDF5 "tcomplex.h5" {
DATASET "/DatasetFloatComplex" {
   DATATYPE  H5T_CPLX_IEEE_F32LE
   DATASPACE  SIMPLE { ( 10, 10 ) / ( 10, 10 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      SIZE 800
      OFFSET 2048
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE  -1+1i
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_LATE
   }
   DATA {
   (0,0): 10+0i, 1+1i, 2+2i, 3+3i, 4+4i, 5+5i, 6+6i, 7+7i, 8+8i, 9+9i,
   (1,0): 9+0i, 1.1+1.1i, 2.1+2.1i, 3.1+3.1i, 4.1+4.1i, 5.1+5.1i, 6.1+6.1i,
   (1,7): 7.1+7.1i, 8.1+8.1i, 9.1+9.1i,
   (2,0): 8+0i, 1.2+1.2i, 2.2+2.2i, 3.2+3.2i, 4.2+4.2i, 5.2+5.2i, 6.2+6.2i,
   (2,7): 7.2+7.2i, 8.2+8.2i, 9.2+9.2i,
   (3,0): 7+0i, 1.3+1.3i, 2.3+2.3i, 3.3+3.3i, 4.3+4.3i, 5.3+5.3i, 6.3+6.3i,
   (3,7): 7.3+7.3i, 8.3+8.3i, 9.3+9.3i,
   (4,0): 6+0i, 1.4+1.4i, 2.4+2.4i, 3.4+3.4i, 4.4+4.4i, 5.4+5.4i, 6.4+6.4i,
   (4,7): 7.4+7.4i, 8.4+8.4i, 9.4+9.4i,
   (5,0): 5+0i, 1.5+1.5i, 2.5+2.5i, 3.5+3.5i, 4.5+4.5i, 5.5+5.5i, 6.5+6.5i,
   (5,7): 7.5+7.5i, 8.5+8.5i, 9.5+9.5i,
   (6,0): 4+0i, 1.6+1.6i, 2.6+2.6i, 3.6+3.6i, 4.6+4.6i, 5.6+5.6i, 6.6+6.6i,
   (6,7): 7.6+7.6i, 8.6+8.6i, 9.6+9.6i,
   (7,0): 3+0i, 1.7+1.7i, 2.7+2.7i, 3.7+3.7i, 4.7+4.7i, 5.7+5.7i, 6.7+6.7i,
   (7,7): 7.7+7.7i, 8.7+8.7i, 9.7+9.7i,
   (8,0): 2+0i, 1.8+1.8i, 2.8+2.8i, 3.8+3.8i, 4.8+4.8i, 5.8+5.8i, 6.8+6.8i,
   (8,7): 7.8+7.8i, 8.8+8.8i, 9.8+9.8i,
   (9,0): 1+0i, 1.9+1.9i, 2.9+2.9i, 3.9+3.9i, 4.9+4.9i, 5.9+5.9i, 6.9+6.9i,
   (9,7): 7.9+7.9i, 8.9+8.9i, 9.9+9.9i
   }
   ATTRIBUTE "AttributeFloatComplex" {
      DATATYPE  H5T_CPLX_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 1 ) / ( 1, 1 ) }
      DATA {
      (0,0): -1+1i
      }
   }
}
}

Datatype conversions between all the usual C types (int, long, float, long double, etc.) have been added, including _Float16 when support for it is available in the library (though note conversions may be a bit slower since there’s no standard C float16 complex number type currently).

For the conversions between existing conventions, I’ve implemented no-op conversions as long as the data follows these rules (which can also be expanded upon as needed):

An array datatype must consist of exactly two elements where each element is of the
same floating-point datatype as the complex number datatype’s base floating-point
datatype.
A compound datatype must consist of two fields where each field is of the same
floating-point datatype as the complex number datatype’s base floating-point
datatype. The compound datatype must not have any leading or trailing structure
padding or any padding between its two fields. The fields must also have compatible
names, must have compatible offsets within the datatype and must be in the order
of “real” part → “imaginary” part, such that the compound datatype matches the
following representation:
```
H5T_COMPOUND {
    <float_type> "r(e)(a)(l)";                OFFSET 0
    <float_type> "i(m)(a)(g)(i)(n)(a)(r)(y)"; OFFSET SIZEOF("r(e)(a)(l)")
}
```
where “r(e)(a)(l)” means the field may be named any substring of “real”, such as
“r”, or “re” and “i(m)(a)(g)(i)(n)(a)(r)(y)” means the field may be named any
substring of “imaginary”, such as “im” or “imag”.

I’ve confirmed the conversions work as expected with test data, but I’m also looking for any real h5py-written data files just to make sure I’m not overlooking anything. Let me know if you can point me to some!

While I don’t expect much will change, note that this is all subject to change with review and also if there’s something about support for complex types that doesn’t work well for an application or is awkward to use.

peter.hill · July 2, 2024, 1:07pm

Really incredible stuff, thank you so much!

Here’s a real world file generated using h5py:

scotty_output.nc (436.7 KB)

The variable /analysis/H_1_Cardano is complex.

jhenderson · July 2, 2024, 3:46pm

Thanks for the file! I verified that the data can be read directly into a double _Complex buffer (on my machine) through the no-op conversion path and the output matches the data in the file. Here’s the simple C program I used for an example of what this looks like after the changes are merged:

read_scotty_output.c (790 Bytes)

which gives the output:

DATA: [
(1+0i), 
(0.953434+4.33681e-19i), 
(0.909362+8.67362e-19i), 
(0.867738+0i), 
(0.828514-1.73472e-18i), 
(0.791641+0i), 
(0.757062-6.93889e-18i), 
(0.724719+3.46945e-18i), 
(0.694546+0i), 
(0.666468+3.46945e-18i), 
(0.640409-3.46945e-18i), 
(0.616281+6.93889e-18i), 
(0.593994+0i), 
(0.573452-6.93889e-18i), 
(0.554554+6.93889e-18i), 
(0.537199+6.93889e-18i), 
(0.521284+0i), 
(0.506711+1.38778e-17i), 
(0.493378+1.38778e-17i), 
(0.481193+0i), 
(0.470062+2.77556e-17i), 
(0.459902+0i), 
(0.450631+0i), 
(0.442174+1.38778e-17i), 
(0.434462+1.38778e-17i), 
(0.427432+0i), 
(0.421025+4.16334e-17i), 
(0.415189+1.38778e-17i), 
(0.409875-1.38778e-17i), 
(0.405041+1.38778e-17i), 
(0.400647+0i), 
(0.396658+2.77556e-17i), 
(0.393042-4.16334e-17i), 
(0.38977+0i), 
(0.386818-2.77556e-17i), 
(0.384161-2.77556e-17i), 
(0.381781+0i), 
(0.379658-2.77556e-17i), 
(0.377775+2.77556e-17i), 
(0.37612+5.55112e-17i), 
(0.374679+0i), 
(0.373441+0i), 
(0.372396-2.77556e-17i), 
(0.371536+0i), 
(0.370853+0i), 
(0.37034+0i), 
(0.369992+0i), 
(0.369806-2.77556e-17i), 
(0.369776+0i), 
(0.369901+5.55112e-17i), 
(0.370023+2.77556e-17i), 
(0.370178+0i), 
(0.370607+2.77556e-17i), 
(0.371187-2.77556e-17i), 
(0.371918+0i), 
(0.372802-2.77556e-17i), 
(0.373841+0i), 
(0.375038+0i), 
(0.376397-2.77556e-17i), 
(0.377922-2.77556e-17i), 
(0.379619-5.55112e-17i), 
(0.381496+0i), 
(0.38356-2.77556e-17i), 
(0.38582+0i), 
(0.388287+0i), 
(0.390973-2.77556e-17i), 
(0.393891+0i), 
(0.397058+0i), 
(0.400492+0i), 
(0.404213+0i), 
(0.408243+0i), 
(0.412609+0i), 
(0.417338+0i), 
(0.422463+4.16334e-17i), 
(0.428021+0i), 
(0.434052-4.16334e-17i), 
(0.440602+1.38778e-17i), 
(0.447722+0i), 
(0.45547+1.38778e-17i), 
(0.46391+1.38778e-17i), 
(0.473114+2.77556e-17i), 
(0.483162+1.38778e-17i), 
(0.494143+0i), 
(0.506153+4.16334e-17i), 
(0.519302-1.38778e-17i), 
(0.533707+0i), 
(0.549494-2.77556e-17i), 
(0.566803+1.38778e-17i), 
(0.585777+0i), 
(0.606571-1.38778e-17i), 
(0.629344-2.08167e-17i), 
(0.654258-1.38778e-17i), 
(0.681478-6.93889e-18i), 
(0.711165+0i), 
(0.743478+0i), 
(0.778569+3.46945e-18i), 
(0.81658+1.73472e-18i), 
(0.857645-1.73472e-18i), 
(0.901887+0i), 
(0.949421+0i)
]

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5