Naming of HDF5 datatypes: looking for feedback

The HDF Group is currently in the process of adding support for several new predefined datatypes to make working with specific types of data a little more convenient. This process has brought up some points that have been discussed internally, but we also wish to gather any feedback that the HDF5 community might have to help shape development. We are looking to get any feedback by Friday, April 25th so that we can continue to move forward with development. Thanks for any feedback you may have!

  1. Standardizing naming of HDF5 predefined datatypes

Predefined datatypes in HDF5 have typically followed a naming convention of H5T_<Architecture or standard>_<Specifics>. For example, H5T_IEEE_F32LE denotes a 32-bit little-endian IEEE format floating point datatype. With some of the new datatypes that we’d like to add support for, this naming convention may not necessarily give enough information to make a datatype’s composition clear or may make it somewhat awkward to come up with a good name. We’d like to propose standardizing on a convention of H5T_<Type class>_<Optional architecture or standard>_<Specifics> going forward. See Add predefined datatypes for bfloat16 data by jhendersonHDF Ā· Pull Request #5402 Ā· HDFGroup/hdf5 Ā· GitHub for more context around this issue. While a minor change, the intent is to account for datatypes that don’t necessarily fall into a particular standard while also making it very clear up front what type of data is represented by the datatype (by adding in the type class information). To retain compatibility, the names of existing HDF5 datatypes would not be changed, but new names for existing datatypes which match the new convention could potentially be added.

  1. Importance of big-endian predefined datatypes

Historically, both little- and big-endian variants of various HDF5 predefined datatypes have been added to the library. However, this increases maintenance burden, however slight, and also slightly increases runtime memory usage of HDF5 as these datatypes and their associated IDs are kept resident while the library is in use. We are now considering whether it makes sense to only introduce little-endian variants of predefined datatypes going forward, while either making it the responsibility of the application to convert a little-endian datatype into a big-endian datatype with H5Tset_order or potentially implementing a solution such as high-level library routines to create big-endian variants of datatypes as needed rather than always keeping them in memory.

I suggest some foundational changes to the datatype system, in support of your ā€œmore convenientā€ request above.

  • Don’t reinvent the wheel. Avoid a complicated systematic scheme that tries to solve all problems in advance. Go for industry standard or evolving standard simple base names. Examples from the PR #5402 discussion:

    H5T_BFLOAT16
    H5T_FP8_E4M3
    H5T_FP8_E5M2

  • Stop overloading type names with endian variants. Use the single base type name everywhere. Handle the elemental storage details with defaults, function call properties, and storage headers in the physical file. For example, a default of ā€œnativeā€ will handle both the memory interface and the physical file storage, on both big- and little-endian systems. Only in unusual cases, engage special properties to do something different and explicit.

I believe this new scheme can be implemented while preserving all existing type names and API functionality. Property list extensions would be needed. A format extension to the physical datatype header is likely, but I have not looked into that.

1 Like

@dave.allured Thanks for your ideas. I’m sure others will have some discussion on this, but a few things to start off with:

Avoid a complicated systematic scheme that tries to solve all problems in advance.

I certainly don’t think the naming convention has to be particularly complicated here, but I feel fairly strongly about at least including the type class in the name. I’m of the opinion that a quick glance at a type name should immediately give you an understanding of what type of data the datatype represents, without necessarily having to refer to the library’s documentation or some other resource for the type. I think the examples mentioned don’t really illustrate this since I’d expect ā€˜FP8’ or ā€˜BFLOAT’ to be intuitive for many, but I can foresee names for ā€˜non-standard’ types that may not give you a good intuition about the type of data right away. The primary motivator here was that the library previously included ā€˜IEEE’ in most of the predefined floating point datatype names, because that’s what was generally supported. However, most of the new floating point types we want to add support for aren’t IEEE types and some don’t even follow the IEEE convention, so I wanted a naming convention which clarifies that the different sets of types (and potential new, less well-named types in the future) are both floating-point datatypes, while retaining the information that some of the datatypes represent data according to the IEEE convention.

When considering a potential future predefined datatype for a complex number consisting of two bfloat16 components, it would seem natural to me to pick a name such as H5T_COMPLEX_BFLOAT16 (which is basically similar to what I chose for a complex number of float16, H5T_COMPLEX_IEEE_F16LE). This conveniently falls into my suggestion of adding the type class into the name, at which point I’d say I prefer the symmetry of H5T_FLOAT_BFLOAT16, but I could certainly see why some might consider that redundant. It’s really just my preference of trying to make the datatypes ā€˜look the same’.

For example, a default of ā€œnativeā€ will handle both the memory interface and the physical file storage, on both big- and little-endian systems.

We’ve been discussing this and I believe it’s likely the way that we’ll move toward in the future, but until we’ve come to a decision on that I’m trying to determine the best way to go with introducing some of these new datatypes in the interim. Adding in big-endian variants like usual is simply more convenient from an application perspective, but might end up being mostly useless in the end. On the other hand, not adding in big-endian variants means that the single datatype added will either be little-endian by default, which decreases convenience/usability slightly if you need a big-endian type, or could be set to native endian-ness, which could be a surprising change, but this might also be the time to start that convention.