HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5

jhenderson · January 31, 2024, 8:53pm

The HDF Group is now considering adding native support for the following datatypes in the HDF5 library:

Half-precision, 16-bit floating-point (float16)
C99-compatible complex numbers
A true Boolean datatype (so users don’t have to create their own datatype or use integer datatypes to store Boolean values)
The bfloat16 floating-point format from Google Brain

We have created an RFC detailing what support for these new datatypes would look like and are actively soliciting feedback to ensure that this functionality will be useful to HDF5 users.

RFC__Adding_support_for_16_bit_floating_point_and_Complex_number_datatypes_to_HDF5.pdf (166.4 KB)

We also discussed a few of the finer points around supporting these new datatypes in this HDF5 Working Group meeting:

We would appreciate any feedback the community has to offer on the topic, especially concerning the on-disk binary format for these datatypes and any data interoperability concerns that you may have.

jan-willem.blokland · February 1, 2024, 10:31am

I am not sure if this should be part of the RFC. If complex number datatype are supported, you may need to think about how this should be supported in the data transform function. I can image with complex numbers even more complicated transform might be possible. In our library we make use of data transform function for applying an unit conversion on the data during reading or writing.

peter.hill · February 1, 2024, 3:49pm

This is really great, thanks! The complex number stuff all looks very sensible to me. I’m very glad you’re already thinking about utilities for supporting existing conventions in applications (I’ve written a bit about such conventions, although mostly thinking about netCDF). I guess there’s not really any hope of files using the new types to be backwards compatible with previous versions of HDF5? Or will it be possible to read data stored as H5T_NATIVE_DOUBLE_COMPLEX using a compound type with e.g. HDF5 1.12?

There’s some mention in the RFC about there being some overhead in doing such a conversion – this is just for HDF5’s own data structures, right? The data itself can be cast directly into a struct { double r, i; } or double[2] with zero overhead.

There was also some discussion in the video about the in-memory native representation of complex according to the C standard, but from reading the RFC this looks like has been resolved? I haven’t checked the actual standard, but cppreference.com says:

Each complex type has the same object representation and alignment requirements as an array of two elements of the corresponding real type (float for float complex, double for double complex, long double for long double complex). The first element of the array holds the real part, and the second element of the array holds the imaginary component.

which is pretty explicit. The C++ standard basically says std::complex is binary compatible with C, while the Fortran standard defines complex as “ordered pairs of real values”. So the on-disk representation in the RFC is compatible with all three languages (as well as Numpy for Python).

There are some applications that use a complex dimension (mostly in netCDF rather than HDF5 directly), this layout is also binary compatible, it’s just that indexing and memory spaces have some overhead when converting.

For _Bool, will arrays be stored as packed structures?

jhenderson · February 1, 2024, 5:05pm

This is a great point! Indeed, I think the possible expressions for data transforms will need to be expanded upon to account for complex number types. While it may be a bit out of scope with regards to the initial implementation for these types, I think it’s worth at least discussing what needs to be added to those expressions (probably in terms of new variables besides just ‘x’) and at least opening a feature request in HDF5 to track this.

gheber · February 1, 2024, 5:10pm

I vote for z. (complex variable) G.

ajelenak · February 1, 2024, 5:50pm

Libhdf5-1.12 is now deprecated so these new datatypes will not be added. However, what you are asking is, I think, whether any libhdf5 with this new datatype will be able read and convert complex data from HDF5 files created with previous libhdf5 versions. I think it should be able if there is a conversion defined for the specific complex number storage convention.

Take care,
Aleksandar

jhenderson · February 1, 2024, 6:05pm

I’ve written a bit about such conventions, although mostly thinking about netCDF

Thanks for your work! I used your page, among several other resources, as a reference when writing this RFC.

I guess there’s not really any hope of files using the new types to be backwards compatible with previous versions of HDF5? Or will it be possible to read data stored as H5T_NATIVE_DOUBLE_COMPLEX using a compound type with e.g. HDF5 1.12?

Certainly things here would have been much simpler if we’d had support for these types long ago… For backward-compatible files, the best approach will probably still be to create files using the compound datatype approach, since the type will be understandable by older versions of the library. Pending further discussions around this, it’s best to operate under the assumption that these will be new datatypes that shouldn’t be assumed to be backward-compatible.

… overhead in doing such a conversion – this is just for HDF5’s own data structures, right?

Correct, there is some internal work that will need to be done related to these conversions, but the idea is that this data should be able to be directly casted into struct { double r, i; } or double[2], as needed, without any type of conversion overhead on the application side.

There was also some discussion in the video about the in-memory native representation of complex according to the C standard, but from reading the RFC this looks like has been resolved?

Yes, a look into the C standard revealed the same thing to me as what you discovered, so I don’t think there are any concerns around that any more. As you also mentioned, it seems pretty much all the other languages involved follow suit, so binary compatibility shouldn’t be an issue.

For _Bool, will arrays be stored as packed structures?

In this case, does “arrays” refer to an HDF5 array datatype consisting of _Bool values as the base type? The intention would be that those values would be stored in 1 bit each linearly, so to speak.

peter.hill · February 2, 2024, 12:29pm

Certainly things here would have been much simpler if we’d had support for these types long ago… For backward-compatible files, the best approach will probably still be to create files using the compound datatype approach, since the type will be understandable by older versions of the library. Pending further discussions around this, it’s best to operate under the assumption that these will be new datatypes that shouldn’t be assumed to be backward-compatible.

Out of curiosity, how plausible is it to make the new complex datatypes backwards compatible? Is there much overhead in the file, for example, for storing the data as compatible with a compound type? Certainly this would be more work, but it’s probably worth just understanding what the costs are.

jhenderson · February 2, 2024, 7:36pm

Essentially, since HDF5 is a self-describing file format, we would want to create a new datatype class (say H5T_COMPLEX) to describe the new types, as well as a new on-disk datatype structure that contains all the information necessary for interpreting the datatype and data of that datatype. A new on-disk datatype structure is what ends up being backward incompatible here, where older versions of the library would see a datatype of the newly-added class “11” (see https://github.com/HDFGroup/hdf5/blob/develop/src/H5Tpublic.h#L30-L45) and would have no idea how to interpret that, so will just fail to decode the datatype when an HDF5 operation involves some object with that datatype. Without actually settling on the compound datatype format as the standard for complex numbers, the library can’t really store complex numbers with an associated on-disk datatype structure that’s directly compatible with the compound type format. The on-disk datatype structure would need to have a class value and form that older versions of the library can understand, with the closest representations being H5T_COMPOUND or H5T_ARRAY, at which point complex numbers are just a compound or array datatype.

While it’s inconvenient due to this compatibility issue, creating a new datatype class at the very least:

Makes HDF5 files self-describing with regards to complex number data
Along with the above (and as I’m sure you’re aware from your very useful nc-complex docs), makes interpretation of data across different parties, data ecosystems, etc. easier, since one shouldn’t have to rely on understanding that some file data, in the context of that file, represents “a complex number” and not “a compound member of two arbitrary double values”, for example
Makes complex numbers a first-class citizen datatype

This is all to say that while we could store complex numbers as actually just compound datatypes on disk, it doesn’t seem like the best approach for self-describing HDF5 files going forward.

w.benger · February 6, 2024, 1:31pm

It sounds problematic to introduce new, incompatible datatype classes for new data types that could be handled as compound data types. For new binary layouts such as float16 or bool this is more natural, but for semantically compound data types this looks like a bottomless bit. The same interest as support for complex numbers would also naturally occur for vectors, like coordinates with x,y,z components, or colors with r,g,b and r,g,b,a components. Why not spend the effort instead in optimizing handling of compound data types such that they provide the same performance as on arrays and provide definitions of the “most-used” compound data types in an add-on library that is shipped with HDF5 such that people can use the same predefined, recommended standard way? There is more complexity to come with supporting things like GL_RGB10_A2: 10 bits each for RGB, 2 for Alpha ( Image Format - OpenGL Wiki ), and it would be natural for HDF5 to read all such mixed strange formats as r,g,b compound types. So supporting complex numbers are just a start of a bigger picture, including quaternions, or geometric algebra in general, or any vector (geometric, colorimetric) or tensor types. Introducing a new class for each of those without backwards compatibility to compound types seems like an infinite effort.

jhenderson · February 6, 2024, 8:28pm

for semantically compound data types this looks like a bottomless bit

Taking the standard’s wording on complex types at face value, it seems like they would be more like an array datatype, but that’s probably just arguing semantics and would be even worse from a compatibility perspective. At least to me, complex numbers seem like such a fundamental type of data that they’re worthy of a separate datatype class, but it’s unfortunate that that would come with these compatibility issues. However, making data self-describing is the critical problem that this new datatype class would be trying to solve and I think it’s something very important that we have to think about before falling back to just using a compound datatype, if that ends up being the case. One could also still keep using the compound type approach if they need their files to be readable by earlier versions of the library, but of course this doesn’t do much for the datatype fragmentation situation.

Why not spend the effort instead in optimizing handling of compound data types such that they provide the same performance as on arrays

One part of this work would be to provide new datatype conversion routines in HDF5 that should be able to seamlessly translate between the different compound type approaches and the in-memory representations (e.g., float _Complex), so at least some parts of this concern should end up being addressed.

provide definitions of the “most-used” compound data types in an add-on library that is shipped with HDF5 such that people can use the same predefined, recommended standard way?

This got me thinking a bit about having (assuming that we don’t go with a compound datatype as the format) a “complex_compat” sort of datatype that uses the most common compound type approach, mostly as a convenience for people wanting to stick to that format for compatibility.

So supporting complex numbers are just a start of a bigger picture, including quaternions, or geometric algebra in general, or any vector (geometric, colorimetric) or tensor types. Introducing a new class for each of those without backwards compatibility to compound types seems like an infinite effort.

Agreed. And I think at least for these types, what HDF5 really needs is something like a “tagged compound type”, similar to how opaque datatypes can have tags set on them with H5Tset_tag so that one can describe the data. However, I’m not sure that that can be done currently without changing the file format, so in a sense it’s a variant on the problem. It might also be an approach that’s applicable to complex numbers, but it doesn’t help the situation with older versions of the library much either.

miller86 · February 6, 2024, 10:08pm

Hi All,

I wanna add some of my own perspectives to this dialog. I have yet to read everything already mentioned here and I see there is a really great treatment of how various different scientific data formats have addressed things like complex data by @peter.hill. Thanks for sharing that. Very insightful!

I have played with half-precision in another C++ project and found this interesting header file, https://github.com/markcmiller86/hello-numerical-world/blob/main/Half.H that emulates it.

I wonder about supporting full quad precision? I have seen more of that lately in HPC/CSE than I recall seeing in the past.
Would it make sense to make these more exotic/advanced types optional? I mean, is there anything to be gained for average users (common cases) where they do not have to deal with these types in their builds? Probably not but it doesn’t hurt to ask
Given the common ways others have already tried to support complex types, would it make sense to offer some HL functionality that makes it easy for them to convert (on read or on write) between these common cases and the new native types? Maybe some extra “filters” for this? I dunno
Complex types are kinda interesting because they are vector valued (2-tuple). So, this raisese some interesting questions not typically associated with int or float types…what is the order of the vector components (real, imag or imag, real) and what is the basis of the vector components (real/imag or mag/phase)? Using endieanness as a metaphor, an HDF5 file does NOT require or impose an endianness in the file (e.g. there is no HUB format for endieanness). Either of the two endienesses are allowed and the file stores additional information indicating which. If we extend this to vector types, then we are confronted with some similar questions. Does it make sense for HDF5 to impose a basis (such as real/imag) in the file or does it make sense for HDF5 to allow whichever bases make sense and something in the file indicates which the basis is? I notice another responder mentioned a similar issue arises with values intended to represent colors. This is another vector valued quantity which brings with it notions of order of the vector components and the basis (HLS, CIY, CMYK, RGB, etc.).
Have you considered quaternions? It might be helpful to take a look at them from the perspective of understanding the “generalization territory” you may find yourself in? Is anyone doing stuff with quaternion-valued data in HDF5 (I asked pro version of ChatGPT it didn’t indicate so)? If so, how have they done it?
Will you support any primitive (int, float) type as the base type for complex numbers? For example, can you have complex numbers with 16 bit floats for the components or long doubles?
Electrical Engineering community is likely more interested in complex numbers as mag/phase than real/imag. That doesn’t necessarily mean mag/phase should be supported in the file but it might mean a whole class of users winds up getting hit with conversion (performance and accuracy) losses if the file doesn’t support it.
What about other tensor valued quantities such as we see in structural and mechanical engineering applications? They have stress tensors, etc. to deal with and I suspect those communities have used the same hacks (struct, additional array dimension, multiple assocaited datasets) to handle these. What might all these cases have in common that would be useful to perhaps refactor into a commonly useful feature?

Sorry all I have is questions. I think this is a great addition to HDF5 and I enjoy the braoder conversations about how to deal with these more complicated issues.

w.benger · February 7, 2024, 9:29am

A complex (or in general, vector) data type would only be an array data type if it’s homogeneous. If you want to store the real part in double precision and the imaginary part in single precision, it is no longer the case. As compound data types, HDF5 would automatically convert such a mixed data type into a pair of double precision or single precision. This is a great already existing feature, and with those mixed color-space data types like 10 bit for red,green,blue plus 2 bit for alpha, it becomes even more inhomogenous. Re-implementing these data transformations via a new native datatype, with all those possible combinations, can that ever be effective?

The only issue seems that having a float x,y,z; compound type in a file and reading it as double x,y,z; compound data type is currently very slow, and storing data as float[3] and reading as double[3] is much faster. But is there a reason why this could not be optimized for the same performance? The library should be able to detect that float x,y,z; is in-memory compatible with float[3] and then provide the same performance, leaving the “slow”, generic conversion to the more exotic cases.

As a compound data type, HDF5 files remain very well self-describing, provided a meaningful semantics is used to define the compound data types. A while ago I wrote a paper discussing how data types such those as used in Geometric Algebra (the generalization of complex numbers and quaternions to arbitrary dimensions) and tensors (as used in General Relativity) can modelled via HDF5: http://sciviz.cct.lsu.edu/papers/2009/GraVisMa09.pdf ( skip forward to section 5.2 if you want to skip the math / physics part ). Using named data types with attributes serves quite useful there, only the technical difficulty is that they can only be used on data types bound to a specific file, rather than applied to in-memory data types. This makes I/O on such data types rather effortsome - if this part could be remedied such that also in-memory data types can hold attributes, it would address that issue well and solve lots of troubles. Attributes on data types are much more powerful than simple tagging (and they are already supported in HDF5 files).

Just to mention about Geometric Algebra here: only with this framework you would have a complete set of operations on vector data types in arbitrary dimensions. Quaternions are a subset of Geometric Algebra in 3D, complex numbers are the 2D case. However, there are different “flavors” of Geometric Algebra, some fields in computer graphics uses conformal geometric algebra in five dimensions to quite some success. It would be too much for HDF5 to support it “natively” in all flavors, but it would totally make sense for HDF5 to support a basis framework on which such flavors can build, with complex numbers just being the start of it, but not as an independent, special case. There are also alternatives to Geometric Algebra when it comes to generalize complex numbers and quaternions (like octonions, septenions, …).

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag. Tagging makes sense for opaque data types, but with a compound data type we already have the capability to “parse” a data type description, actually it is already “parsed” as a compound. All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector. That also allows - via compound types - reading a 3D cartesian vector as 2D cartesian vector, even if is stored as “z,x,y” or whatever order. This is a great feature that HDF5 already provides. What may well be desireable (beside the mentioned performance optimizations) is the ability to define transformation “filters” that allow to interpret a {float x,y;} compound as a {float re,im;} structure (i.e., define them as equivalent), or even more to perform actual coordinate transformations {float x,y;} → {float r, phi;}, as mentioned for complex numbers as well. The implementation of such transformations could go well into an addon-library, rather than being core HDF5, which should only handle the infrastructure for defining data types (possibly with attributes on transient types), equivalence relationships and support transformation rules.

tobias · February 7, 2024, 3:01pm

I am very happy to see this RFC. I am interested in complex number support and am happy with the proposed approach.

My only comment is to consider that the HDF5 tools output complex numbers as x+yi or x+yj format, which is used in various places. However the only objection to the proposed Complex { real: imag: } I have is the verbosity, I am very much bikeshedding, and will happily accept this format if it leads to a speedy adoption of the RFC.

jhenderson · February 7, 2024, 4:15pm

I can at least answer a few of these questions right away, while leaving others open-ended for now and keeping them under consideration:

I wonder about supporting full quad precision? I have seen more of that lately in HPC/CSE than I recall seeing in the past.

I see no issue in general with supporting quad precision types in HDF5; it mostly comes down to how well supported they are across platforms and compilers. If there are a variety of platform/compiler combinations that don’t have good support, it becomes a bit more painful to support in the library, especially when it comes to the type conversion code.

Would it make sense to make these more exotic/advanced types optional? I mean, is there anything to be gained for average users (common cases) where they do not have to deal with these types in their builds? Probably not but it doesn’t hurt to ask

Since the library keeps around global H5T_t structures in memory for each of the native and standard types available, as well as the library IDs for those types, the main benefit of the types being optional would seem to be less pressure on memory and available library IDs, though I’d expect that to be minimal (sizeof(H5T_t) = 104, currently). However, I could certainly see us just making templates available for creating/using some of these types as an alternative, likely through the HL interface.

Given the common ways others have already tried to support complex types, would it make sense to offer some HL functionality that makes it easy for them to convert (on read or on write) between these common cases and the new native types? Maybe some extra “filters” for this? I dunno

I’d certainly like to make it as easy as possible to convert from existing data into these types (and vice-versa) if we go this way with them. The thought is to provide datatype conversion paths that will accept a compound type in any of the forms used in the nc-complex doc mentioned in this thread, and directly convert between that and the C99 types, since they should be memory compatible.

Does it make sense for HDF5 to impose a basis (such as real/imag) in the file or does it make sense for HDF5 to allow whichever bases make sense and something in the file indicates which the basis is?

When writing the RFC, I was mostly sticking exactly to the C standard on complex types (with the ordering being real → imag) to give the best opportunity for efficient datatype conversions between the on-disk and in-memory formats. But as the discussion in this thread seems to be moving more toward thoughts around general vector types, I do think the ability to specify ordering for the vector components is likely going to be needed for those types. I’m still of the inclination to have HDF5’s support for complex numbers map directly to the C types for portability reasons, but this thread is giving me a lot of things to think about around future support for different datatypes.

Will you support any primitive (int, float) type as the base type for complex numbers? For example, can you have complex numbers with 16 bit floats for the components or long doubles?

I’m imagining a new H5T API for creating complex numbers, say H5Tcomplex_create, that takes a base datatype ID as a parameter. For simplicity, the three main types of float, double and long double would likely be the only supported supported types at first, but this should allow for expansion in the future. For portability reasons it’s easier to support just these three at the moment, but I know GCC supports other base types, for example. I haven’t researched this for other compilers yet.

The rest of your questions are certainly among the things that need to be considered for this functionality, though again I’ve only been thinking of complex numbers in terms of the C standard for wide portability across platforms and languages. There’s definitely a broader discussion to be had here around general vector types.

jhenderson · February 7, 2024, 7:54pm

A complex (or in general, vector) data type would only be an array data type if it’s homogeneous.

Indeed, I should say that the intent has been for initial complex number support in HDF5 to map directly to the C standard for portability reasons, in which case the type would be homogeneous and both components of the complex type would be stored with the same precision. Though, based on your insights and comments from @miller86 as well, I see that there’s a larger discussion to be had around creating a basis for more general vector types.

Re-implementing these data transformations via a new native datatype, with all those possible combinations, can that ever be effective?

I’d think not and would like to not run up against the combinatorial explosion there, but I also wasn’t (based on my previous comment) thinking of support for vector values in a more general fashion. The only reasonable way to properly support general vector types would seem to be with compound datatypes rather than a new native type. The reason why complex numbers are being considered as a special case here stems mostly from their direct support in the C standard. If they are implemented as a native type, conversion routines would be written for each of the more interesting combinations, but an HDF5 user could also always define their own conversion routine if they have a specific case they want handled efficiently. Granted, it is of course much more convenient if HDF5 can do this automatically.

The only issue seems that having a float x,y,z; compound type in a file and reading it as double x,y,z; compound data type is currently very slow, and storing data as float[3] and reading as double[3] is much faster. But is there a reason why this could not be optimized for the same performance? The library should be able to detect that float x,y,z; is in-memory compatible with float[3] and then provide the same performance, leaving the “slow”, generic conversion to the more exotic cases.

Part of this work would involve fixing up the library to recognize some of these types of situations, regardless of the final on-disk format for complex number types, since there is of course an advantage to being able to directly convert from a float x, y; compound form to the C99 float _Complex type that’s just the same as float[2]. Converting from float x, y, z; to float[3] and converting from float x, y, z; to double[3] are just extensions on that, and could likely fall within the scope of this work.

As a compound data type, HDF5 files remain very well self-describing, provided a meaningful semantics is used to define the compound data types. A while ago I wrote a paper discussing how data types such those as used in Geometric Algebra (the generalization of complex numbers and quaternions to arbitrary dimensions) and tensors (as used in General Relativity) can modelled via HDF5: http://sciviz.cct.lsu.edu/papers/2009/GraVisMa09.pdf ( skip forward to section 5.2 if you want to skip the math / physics part ). Using named data types with attributes serves quite useful there, only the technical difficulty is that they can only be used on data types bound to a specific file, rather than applied to in-memory data types. This makes I/O on such data types rather effortsome - if this part could be remedied such that also in-memory data types can hold attributes, it would address that issue well and solve lots of troubles. Attributes on data types are much more powerful than simple tagging (and they are already supported in HDF5 files).

Thanks for the link to the paper! It’s always interesting reading about how people are using HDF5. Of course it seems that the issue lies in “provided a meaningful semantics is used to define the compound data types”. More discussion below between the attribute approach and possible alternatives…

Just to mention about Geometric Algebra here: only with this framework you would have a complete set of operations on vector data types in arbitrary dimensions. Quaternions are a subset of Geometric Algebra in 3D, complex numbers are the 2D case. However, there are different “flavors” of Geometric Algebra, some fields in computer graphics uses conformal geometric algebra in five dimensions to quite some success. It would be too much for HDF5 to support it “natively” in all flavors, but it would totally make sense for HDF5 to support a basis framework on which such flavors can build, with complex numbers just being the start of it, but not as an independent, special case. There are also alternatives to Geometric Algebra when it comes to generalize complex numbers and quaternions (like octonions, septenions, …).

I must admit that I was missing the background on these higher order types, but it makes perfect sense that we want to consider building the framework for supporting these. From an API perspective, it seems like the H5Tcomplex_create API routine mentioned above could be a special case of a more general one for higher order types, but those are just my initial musings on the topic.

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag.

Forgive me if this is a naive interpretation, but isn’t this similar to the approach discussed in your paper, in that one would need to parse information from the attributes attached to the named datatype used for data objects? If that’s the case, one could argue that there’s no need to add an alternative approach when yours suffices, but I’d say that it seems more natural to get information about a datatype directly from that type (e.g., through H5T API routines), rather than needing to store the type in the file and retrieve information about it through the H5A API. Yes, one would need to parse the tag associated with a compound type, but the idea would be that HDF5 could have a standard set of tag strings for various standardized data formats.

I can see how that doesn’t exactly generalize well compared to your approach with attributes though. Perhaps if there are common bits of information that you’re currently storing in attributes attached to named datatypes (such as the attribute shown in your paper that contains the dimensions of the data), these can be accounted for with an H5T API to make retrieving that information a little more straightforward? There would likely need to be some file format changes or adaptations so that this information could be stored in the file with a datatype structure, but I think that could also make it reasonable for that information to be associated with in-memory datatypes. Again though, it would probably be a bit less flexible than the attribute approach.

All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector.

Agreed that we need some sort of convention for representing a standardized type through use of a compound datatype, which is what I was (perhaps poorly) attempting to work towards with compound type tags. Though it may have no basis in reality, my concern around choosing “a compound type with members of this type named in this way” as a convention is that we may risk stepping on the toes of other standardized type formats in the future if they would have similarly named compound members. The tag approach just avoids placing importance on the naming of compound type members.

What may well be desireable (beside the mentioned performance optimizations) is the ability to define transformation “filters” that allow to interpret a {float x,y;} compound as a {float re,im;} structure (i.e., define them as equivalent), or even more to perform actual coordinate transformations {float x,y;} → {float r, phi;}, as mentioned for complex numbers as well. The implementation of such transformations could go well into an addon-library, rather than being core HDF5, which should only handle the infrastructure for defining data types (possibly with attributes on transient types), equivalence relationships and support transformation rules.

I think moving the implementations for these transformations elsewhere, either into a separate library or perhaps in HDF5’s high-level library, is an interesting idea worth exploring and could allow for much more specialized implementations where possible.

Having said all that, thank you for the great points for consideration!

jhenderson · February 7, 2024, 8:02pm

My only comment is to consider that the HDF5 tools output complex numbers as x+yi or x+yj format, which is used in various places. However the only objection to the proposed Complex { real: imag: } I have is the verbosity, I am very much bikeshedding, and will happily accept this format if it leads to a speedy adoption of the RFC.

Sure, we definitely aren’t settled on any particular format currently, so if people feel this is a better and more useful representation then I believe we’d be happy to use it. I mostly just want to see if there are alternate forms people may wish to see other than x+yi.

lori.cooper · February 7, 2024, 8:19pm

Sorry for the double post, but I wanted it to be in this thread, too.

Audio only podcast version here.

w.benger · February 8, 2024, 4:12pm

Just to provide a bit of background on the higher order types, particularly the extension of complex numbers to arbitrary dimensions: For complex numbers, you have members like {re,im}. If you’d generalize this to a complex number of complex numbers, it would become a four-component compound like {re.re, re.im, im.re, im.im}. That naming scheme does not make much sense to understand what those components actually mean. Interpreting those four components as quaternions, they would be named something like { scalar, i,j,k } with “scalar” corresponds to the real part, and i,j,k the three “imaginary” components.

In the framework of Geometric Algebra we would name those components like this:

2D: scalar, x^y → complex number
3D: scalar, x^y, y^z, z^x → quaternion
4D: scalar, x^y, y^z, z^x, x^t, y^t, z^t → bi-quaternion
5D: …

which extends to arbitrary dimensions, and those vector types are subsets of the most general so-called multivector, which has 2^n components in its full form:

2D: scalar, x, y, x^y
3D: scalar, x,y,z, x^y, y^z, z^x, x^y^z
4D: scalar, x,y,z,t, x^y, y^z, z^x, x^t, y^t, z^t, x^y^z, x^y^t, y^z^t, z^x^t, x^y^z^t
5D: …

There is quite some work on using geometric algebra in 5D, e.g. https://geometricalgebra.org/ .

This merely as demonstration of a generalized framework, in which both quaternions and complex numbers are just special, low-dimensional cases. Of course, there can still be different conventions on how to name the basis, e.g. some prefer to use e_1, e_2, e_3 instead of x,y,z for even more generalization beyond Euclidean space. Also, there are other generalizations beside Geometric Algebra.

The cool thing about using such a scheme with compound types in HDF5 is that we could always read a lower-dimensional subset automatically from a higher-dimensional dataset. For instance, if a dataset is stored as a 3D quaternion, but one application only wants to read the 2D subset {scalar, x^y} out of the full {scalar, x^y, y^z, z^x} dataset, then this already works right away with HDF5 how it is, no changes needed to the library (beyond performance issues). It would certainly be cool if the envisioned “future” complex numbers allow the same functionality.

jhenderson:

As for using “tags” - this sounds like a terrible idea to me, because then you end up with having the users to parse the string that is stored in a tag.

Forgive me if this is a naive interpretation, but isn’t this similar to the approach discussed in your paper, in that one would need to parse information from the attributes attached to the named datatype used for data objects? If that’s the case, one could argue that there’s no need to add an alternative approach when yours suffices, but I’d say that it seems more natural to get information about a datatype directly from that type (e.g., through H5T API routines), rather than needing to store the type in the file and retrieve information about it through the H5A API. Yes, one would need to parse the tag associated with a compound type, but the idea would be that HDF5 could have a standard set of tag strings for various standardized data formats.

The essential information in the approach as in the paper is not in the attributes, but in the names of the components. So one just needs to know whether there is an “x” and a “y” component in the data type to find out that this dataset contains a 2D vector. This approach does not require any parsing, only checking the properties of the name datatype with the HDF5 API and comparing the names of the components. So this is done entirely through the H5T API.

Attributes on the named datatypes via the H5A are optional information - such as a reference to the coordinate system that is related to those x,y coordinates, and storing the origin of that one, for example. This information is not essential to know to find out what type that field is, but only information required later to determine how to perform coordinate transformations, for instance (or, physical units). So, this also does not require any string parsing.

Also, the benefit of not having tags, or a fixed list of strings, is that any {x,y,z} data type can be identified as a {x,y} data type subset automatically. Reading a semantically compatible subset comes “for free”. If one really wants to distinguish x,y,z from x,y then one can still also read and compare the number of components.

There are properties such as “grade”, “co-variance”, “contra-variance” and such that come with general tensor-like quantities. These may not always be required to read and operate on a dataset. So, those can be “optional” in this sense, and placed into attributes. It is fine to not read them if they are not needed. But definitely, it would be great if attributes could also be stored on transient, in-memory data types! Currently it is an inconvenient limitation that only named on-disk data types can store attributes. However, such an extension of the HDF5 library could not even require any changes to the H5T API, it would just mean that things work that currently do not (i.e., attributes on transient data types).

jhenderson:

All what is needed is some generally agreed on convention to say something like “any data type that contains x,y,z components (of whatever type) is a cartesian vector” or “any data type that contains x,y components” is a 2D cartesian vector.

Agreed that we need some sort of convention for representing a standardized type through use of a compound datatype, which is what I was (perhaps poorly) attempting to work towards with compound type tags. Though it may have no basis in reality, my concern around choosing “a compound type with members of this type named in this way” as a convention is that we may risk stepping on the toes of other standardized type formats in the future if they would have similarly named compound members. The tag approach just avoids placing importance on the naming of compound type members.

Hm, I would just favor that, placing importance in the naming of compound type members, rather than changing the HDF5 API and the file format. It should be acceptable to introduce a convention that e.g. compound type names starting with _ or H5__ are reserved, similar to the C standard also reserving identifiers that start with _ . Or, maybe at most, introduce a bool tag in the H5T API telling a data type is an “HDF5 data type” versus a user-defined data type?

And support for the myriads of coordinate transformations out there would be great for the geosciences area! But definitely, you can’t have them all implemented in the core library, those should rather go through a plugin system similar to the compression filters.

ajelenak · February 8, 2024, 5:13pm

I think this could be handled with a command line option --imaginary-unit. The default would be i but that could be changed to j with --imaginary-unit j.

Aleksandar

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 RFC: Adding support for 16-bit floating point and Complex number datatypes to HDF5