Machine-readable units format?

This discussion is interesting, and my impression that the motivation and suggested standard leans towards making it programmatically easy to “convert & replot” in a chosen unit (e.g. foot rather than metre or hours rather than seconds, without the full KOQ or application-specific information to distinguish Bq from Hz) but not making it easy to “document” any experimental recording.

I’m preparing to start saving data in HDF5 for our lab, and I will try to stay close to the NeXus standard developed for x-ray & neutron facilities. Since my own research field doesn’t have the tradition of data formats interchangeable between labs, I’m not going to make large efforts to follow every possibility with the NeXus but they give describe a quite reasonable layout of the HDF5 datastructures to enable generic programs that can produce diagrams/plots by associating axes to signals. When the quantity I want to save already has a standardized name I can of course use that rather than inventing my own name.

When it comes to units, as already pointed out in this thread, NeXus simply uses a string attribute “units” on the dataset, which typically just contains the human-readable unit abbreviation. http://download.nexusformat.org/doc/html/datarules.html#nexus-data-units Strictly speaking they refer to Unidata UDunits which seems really complex to fully support on the reading-side (allowing unusual things like arbitrary offsets or scalings), but also capable of doing the unit conversions of data that I see as the aim of this thread. I don’t have experience of what existing NeXus-compatible tools do on the reading side, if they’re not written in C I would actually doubt that they support all features of the UDunits, but in practice that’s certainly fine.

A first issue would be that if you name an attribute “units” for it becomes impossible to support both NeXus and the new more machine-convertible standard. So I would suggest finding another name, like “base_units” or “unit_base” (base because you didn’t want it to contain the scaling prefixes, and then the scaling nominator and denominator attributes do the scaling).

I can understand the idea to develop a new standard, limiting the complexity for full UDunits support on the reading side, but I would like to mention some issues for for saving raw data with the current proposal.
As a general rule I want to be able to save the data in the units it was produced or calibrated, not convert to base SI units directly with floating point rounding. The idea of requiring integer nominator and denominator would make it impossible to use the units of degrees (for exact angles of rotating a mechanical component or an image e.g. 45 or 90 degrees), molar concentrations “mol/L” and the in (photon, accelerator, surface) science commonly used unit “eV” (the energy of 1 electron of charge with 1 V of electric potential) or the “°C” that my thermometer shows. If you allow floating point scaling factors, they can all be handled approximately by any reader, and a domain-specific reader can with some tolerance like 10*eps(scaling) recognize e.g. that the unit is “degrees” rather than an arbitrary rescaling of radians, and if the user wishes to use the data in units of degrees it would then not need to multiply with any inaccurate conversion factor.

So, for my purposes it would make sense to both

  • use a NeXus/UDunits very liberal/free-form attribute “units” field to write the actual units in a human-readable way like “km/ns”, “°C”, “foot/minute”, “deg.eV^-1” or “deg eV-1”, “Bq” and “kiB” (for 1024 Byte, not sure if allowed by UDunits).
  • use something like the proposed standard for interchange/conversions, reducing it down to a description in therms of SI “base_units” (text form with powers as proposed is nicely human-readable rather than arrays of 7 powers in an standardizedorder) and scaling nominator and denominator (allowing floating point numbers for units that need it, but not to be used with the “inch” that has an exact ratio).

I don’t have any strong opinion between the two approaches for whether rad and sr should be allowed as names in the list of base_units or whether “m1 m-1” and “m2 m-2” should encode them there (they do have some physical justification, not simply dimensionless amount). For units of (countable, dimensionless) amount, e.g. “mol”, “dozen”, “B” and “bit” it would be nice if there was a clear solution too – to me it would make sense to let their base_unit be 1 or an empty string, then have the nominator 12 for a dozen or 6.02214076×10^23 for mol. I’m not sure if we want to use scale 1/8 for bit or scale 8 for byte, but it would be a failure it the new standard would not allow automatic conversion between Mbit/s and MB/s.

For more headache, you may or may not want to think about logarithmic units like dB, astrophysical luminosity and the Richter magnitude scale. It would require some rigour in the standard and implementing programs to distinguish dB of intensity (power) or dB of amplitude (field), with V or mV or an audio reference level… I guess it would dimension-wise be possible to convert audio power onto the Richter scale, just as radioactive decay in Bq could be expressed in 1/s or (formally incorrect) Hz. For all such complicated domain-specific cases, there can always be the almost free-form “units” attribute, if you ensure that your specification uses another attribute name.

Thanks Erik; you make a good point that it would be useful to avoid conflicting with existing attributes used to describe units, so people can conveniently use both.

For irrational scalings like degrees, my thinking is that tools could recognise scale factors of 180/pi with a tolerance threshold (e.g. the precision available with 16-bit numerator & denominator) as referring to that unit. This gives a form of graceful degradation: tools that recognise that can handle them exactly as degrees (for instance), and simpler tools that don’t will still do something sensible with them, but with a small loss of precision.

For logarithmic scales, my plan is not to go there for now. :slightly_smiling_face:

To remind everyone where this topic got to, I’ve now suggested two slightly different schemes:

  1. Based around SI base units, so a measurement in kJ would be annotated with units="kg m2 s-2" (or base_units as Erik suggests) and units_scale_numerator=1000 (the original proposal).
  2. With a specified list of quantity names and a scale factor explicitly to the SI units for that quantity. So kJ would be annotated quantity="energy" and si_unit_scale_numerator=1000.

I am interested to look at it – and cooperate with a good accepted simple solution. However I can’t speak for others. What you are looking at is a long procedure in terms of acceptance with a nice to have component: unified way to tag datasets. – with a possibility of rejection.

From an HDF5 perspective, I believe that units’ natural “anchor place” is the HDF5 datatype. Unit information would be stored and discoverable through appropriate metadata. As you can have user-defined datatypes, you can have user-defined units as part of the same package. Ditto for datatype conversions, which would become unit-aware, i.e., the HDF5 library would refuse to assign/convert the element 1.0 [m] to an element in a dataset with element type H5T_IEEE_F64LE:[s], which are non-convertible datatypes. A user-defined soft conversion could overwrite this behavior, but then it’d be clear who to blame when things go wrong. This would also open the door to integration with compiler-assisted “unit sanity enforcement.” I don’t know if it would meet your definition of ‘simple.’ It certainly would be effective and robust.

We had a proposal to that effect a few years ago, but it didn’t find a lot of sympathy with reviewers. If anybody wants to join and try again, let me know!

G.

A machine-readable file as defined at 45 CFR 180.20 means a digital representation of data or information in a file that can be imported or read into a computer system for further processing. Examples of machine-readable formats include, but are not limited to, . XML, . JSON and . CSV formats.