This discussion is interesting, and my impression that the motivation and suggested standard leans towards making it programmatically easy to “convert & replot” in a chosen unit (e.g. foot rather than metre or hours rather than seconds, without the full KOQ or application-specific information to distinguish Bq from Hz) but not making it easy to “document” any experimental recording.
I’m preparing to start saving data in HDF5 for our lab, and I will try to stay close to the NeXus standard developed for x-ray & neutron facilities. Since my own research field doesn’t have the tradition of data formats interchangeable between labs, I’m not going to make large efforts to follow every possibility with the NeXus but they give describe a quite reasonable layout of the HDF5 datastructures to enable generic programs that can produce diagrams/plots by associating axes to signals. When the quantity I want to save already has a standardized name I can of course use that rather than inventing my own name.
When it comes to units, as already pointed out in this thread, NeXus simply uses a string attribute “units” on the dataset, which typically just contains the human-readable unit abbreviation. http://download.nexusformat.org/doc/html/datarules.html#nexus-data-units Strictly speaking they refer to Unidata UDunits which seems really complex to fully support on the reading-side (allowing unusual things like arbitrary offsets or scalings), but also capable of doing the unit conversions of data that I see as the aim of this thread. I don’t have experience of what existing NeXus-compatible tools do on the reading side, if they’re not written in C I would actually doubt that they support all features of the UDunits, but in practice that’s certainly fine.
A first issue would be that if you name an attribute “units” for it becomes impossible to support both NeXus and the new more machine-convertible standard. So I would suggest finding another name, like “base_units” or “unit_base” (base because you didn’t want it to contain the scaling prefixes, and then the scaling nominator and denominator attributes do the scaling).
I can understand the idea to develop a new standard, limiting the complexity for full UDunits support on the reading side, but I would like to mention some issues for for saving raw data with the current proposal.
As a general rule I want to be able to save the data in the units it was produced or calibrated, not convert to base SI units directly with floating point rounding. The idea of requiring integer nominator and denominator would make it impossible to use the units of degrees (for exact angles of rotating a mechanical component or an image e.g. 45 or 90 degrees), molar concentrations “mol/L” and the in (photon, accelerator, surface) science commonly used unit “eV” (the energy of 1 electron of charge with 1 V of electric potential) or the “°C” that my thermometer shows. If you allow floating point scaling factors, they can all be handled approximately by any reader, and a domain-specific reader can with some tolerance like 10*eps(scaling) recognize e.g. that the unit is “degrees” rather than an arbitrary rescaling of radians, and if the user wishes to use the data in units of degrees it would then not need to multiply with any inaccurate conversion factor.
So, for my purposes it would make sense to both
- use a NeXus/UDunits very liberal/free-form attribute “units” field to write the actual units in a human-readable way like “km/ns”, “°C”, “foot/minute”, “deg.eV^-1” or “deg eV-1”, “Bq” and “kiB” (for 1024 Byte, not sure if allowed by UDunits).
- use something like the proposed standard for interchange/conversions, reducing it down to a description in therms of SI “base_units” (text form with powers as proposed is nicely human-readable rather than arrays of 7 powers in an standardizedorder) and scaling nominator and denominator (allowing floating point numbers for units that need it, but not to be used with the “inch” that has an exact ratio).
I don’t have any strong opinion between the two approaches for whether rad and sr should be allowed as names in the list of base_units or whether “m1 m-1” and “m2 m-2” should encode them there (they do have some physical justification, not simply dimensionless amount). For units of (countable, dimensionless) amount, e.g. “mol”, “dozen”, “B” and “bit” it would be nice if there was a clear solution too – to me it would make sense to let their base_unit be 1 or an empty string, then have the nominator 12 for a dozen or 6.02214076×10^23 for mol. I’m not sure if we want to use scale 1/8 for bit or scale 8 for byte, but it would be a failure it the new standard would not allow automatic conversion between Mbit/s and MB/s.
For more headache, you may or may not want to think about logarithmic units like dB, astrophysical luminosity and the Richter magnitude scale. It would require some rigour in the standard and implementing programs to distinguish dB of intensity (power) or dB of amplitude (field), with V or mV or an audio reference level… I guess it would dimension-wise be possible to convert audio power onto the Richter scale, just as radioactive decay in Bq could be expressed in 1/s or (formally incorrect) Hz. For all such complicated domain-specific cases, there can always be the almost free-form “units” attribute, if you ensure that your specification uses another attribute name.