Machine-readable units format?

Ger_van_Diepen · March 5, 2019, 8:59am

Thanks for the proposal. It will certainly be good to have a standard.
I can say a bit more how units are used in casacore ((radio-)astronomical C++ library with a Python interface).

Casacore has quantities (values with units) and measures (quantities with reference frame defining e.g. the zero point of time or if sky coordinates are equatorial, galactic, etc.). Note that some coordinates frames (e.g. azimuth/elevation) may depend on other data such as time or position on earth. In a table schema only constant frames are used.

It can only handle multiplicative units, thus not Celsius to Kelvin or Fahrenheit. Units are stored as strings which are parsed to determine the conversion factor. Note this takes little time as it is usually done only once to access an entire column in a data set.
The data are stored in a tabular way and each column can have a unit and frame defined in the table schema. When writing or reading the data, the unit and/or frame can be given and the data will be converted as needed. I must say that usually the data are read/written directly, thus already have the correct unit.

The table query language (http://casacore.github.io/casacore-notes/199.html) can handle units and frames and converts as needed (also useful as desk calculator :-).

I wonder if it should be considered to also support coordinate frames in HDF5. This can get quite complex though, much more than units.

A few other remarks:

Note that bytes use prefixes such as k for 1000 and ki for 1024.
I also like g much more than kg as the base unit.
I don’t see any problem in rad-deg conversion with an irrational factor. Inch-m basically has the same problem. It is more the question how precise you want to be.
I wonder how to define the units of fields in a compound data type. By having a vector value in the unit attribute?
Will the actual unit be clear to the user when seeing a string, numerator and denominator? When seeing m, 1, 1000 people might doubt if it is mm or km because they might doubt if the values define the conversion from the stored unit to the actual unit or from the actual unit to the stored unit.

Konrad_Hinsen · March 5, 2019, 10:40am

Thomas Kluyver noreply@forum.hdfgroup.org writes:

I’ve put together some ideas. I’m going to dump it into this post for
now, so this will be a long message, but if people are interested,
I’ll put it in a git repository so we can work further on it.

First of all, thanks for this proposal which I think is important and
already very well defined in this first draft.

Since the discussion seems to take up many issues that have been
discussed countless times in the context of handling units in
computations, let me add a bit of background information to avoid
unproductive repetition.

First of all, the proposal is based on the SI unit system. It is
important to realize that SI was designed to standardize measurements
and their communication. It was NOT designed for automated dimensional
analysis. nor for assigning units to computational results. If you want
to describe units for physical measurements, SI is fine. If you want to
describe units in software, it isn’t. That’s a design decision to make
for any unit labeling system.

Second, a complete specification for a physical quantity consists of the
unit plus the kind-of-quantity (KOQ, all that is defined by the norm ISO
80000-1, which is quite complex). KOQ distinguishes between quantities
that are dimensionally equivalent and yet incommensurable. An example is
radioactivity vs. frequency. Both are time⁻1, but that’s all they have
in common.

SI has no systematic support for KOQ, but its derived units are
combinations of dimension plus KOQ that covers the most frequent needs
in measurement practice. For example, “Newton” is unit=m kg/s² and
KOQ=force. This also removes some frequent ambiguities. For
example, Bq for radioactivity vs. Hz for frequency.

However, introducing derived units into computations is a recipe for
enduring headaches. There is no known general algorithm for deciding if
a quantity of unit m kg/s² has KOQ force or not. In a specific context,
such as celestial mechanics, it is possible to encode such decisions as
a small set of rules, but not in general for every possible application
context.

SI also introduces a few pseudo-dimensions that frequently cause
trouble: rad, sr, and mol. They should logically be derived
dimensionless units, since all they do is associate a specific KOQ with
a number. But SI doesn’t consider “dimensionless” a unit, and therefore
adds these three units as base units. Their role is cover frequent nasty
conversion factors by the unit system (rad, sr) or to permit precise
measurements on vastly different scales (mol).

The current proposal leans heavily towards computational management of
units, so I’d propose to leave KOQ completely out, including in its
weaker form of SI derived units. No rad, no sr. Nor bytes or dots. The
one exception is mol, which never causes trouble in computations because
it is based on a conversion factor that is measured rather than
defined. Not having it would make computational chemistry very painful.

Finally, a word on non-SI units with scale factors that are not powers
of ten (inches etc.). I strongly recommend to store these scale factors
as rationals, not floats. Otherwise, it becomes difficult for software
to recognize inches as such: to what precision should the scale factor
be equal to 2.54 times some power of ten for the unit to be considered
inches?

Looking forward to seeing this implemented,
Konrad.

Konrad_Hinsen · March 5, 2019, 10:46am

David Brooks noreply@forum.hdfgroup.org writes:

Having a standard way to store units in HDF5 is vital and it’s great
to see this initiative! Even if not for the first version, may I
suggest that thought be given to extensibility so that researchers are
not deterred from storing units with their data simply because their
favourite units are not in a limited set of concrete units.

This argument is valid for storing the results of measurements. The unit
says something about the measurement that has been performed, and therefore
forcing a conversion to some standard unit implies a loss of
information.

Maybe it would be a good idea to have such a scheme as well, but
distinct from the computational one.

Konrad.

paramon · March 5, 2019, 4:08pm

Hi Konrad,
Thanks for you in-depth comments!

Thomas, do you confirm that you are more focused on units and their conversion, rather than kind-of-quantity (KOQ)? It seems that KOQ encoding might be a target of separate, additional specification; very often KOQ is already evident by context but the measurement units may vary. Having the broadly recognized way of encoding just measurement units is already of huge value, and seemingly doesn’t prohibit later introduction of standard KOQ format.

Also that despite focusing on machine readability, you are using human-sensible strings (“s”) instead of something robotic like “UO:0000010”.

Best wishes,
Andrey Paramonov

thomas1 · March 5, 2019, 9:52pm

Thanks all for the interesting discussion.

I’m not firmly against including some way to record kind-of-quantity along with the units. But it’s not clear to me at the moment what the computer can do with that information. Maybe this is short sighted, but my mental model is that even if the computer can track the units, a human has to know what the quantities mean.

This is the kind of problem I want to fix: imagine I’m generating HDF5 files which record durations in microseconds. You have some code that reads and uses that data. Now I decide to switch to nanoseconds for more precision. Without interoperable units, I somehow have to alert you that the scale has changed, and you have to understand the change, know which files use the new scale, and modify your code. Interoperable units allow well-written code to handle this smoothly. But even if the file describes the kind-of-quantity, I can’t switch from durations to lengths without breaking what you are doing.

I’m also keen to keep the way of specifying units quite straightforward, at least for version 1.0. I work at a facility where data flows through a framework which can handle units, but is then recorded in HDF5 files without the units. And of course everyone is busy with 1000 other things. There’s a much better chance of getting buy-in for a simple scheme which is easily added to what we already do, than a richer scheme which requires more time to understand. Even I find the word ‘ontology’ offputting, and I discuss specifications in my spare time.

I’ve put the draft spec - unchanged for now - into a git repository at GitHub - takluyver/hdf5-units: WIP specification for recording units in HDF5 files . I’m not trying to imply it’s any more final, but rather make it ready to track changes. We can also use Github issues for more specific discussions, if people like. But let’s keep the ‘big picture’ discussion here for now.

Turning to a couple of specific points:

I’m all for simplicity, but to clarify: would you indicate angular units in some other way? Or are you proposing that the specification shouldn’t cover angular measurements? If the latter, is that a ‘come back to it after 1.0’ omission, an ‘angles don’t need units’ omission, or ‘this is so problematic we shouldn’t try’?

This is indeed what I’m going for with the scale numerator & denominator attributes. Do you think we should say these have to be integers, so things like inches can only be expressed as rational fractions?

Coming from a dynamic language, it seems easy to define these attributes as integer or floating-point type - Python will handle either correctly with the same code. But for code in statically typed languages, I imagine this kind of flexibility makes life harder. So maybe it’s better to be more restrictive about what they are. But I’m still inclined to leave the integer size flexible; some units may need 128-bit scaling factors, and I don’t think it’s helpful to require that they are always 128-bit.

thomas1 · March 5, 2019, 10:15pm

Good point. I haven’t used compound types so I didn’t think about them. I think having all the corresponding attributes be vectors so there’s a unit for each field would work. Nested compound types would either be flattened for the list, or just unsupported.

I agree, it probably won’t always be clear, unfortunately. But I don’t see any easy way to avoid that by picking better names, and I think scaling with numeric attributes has advantages over e.g. putting all the scaling information in the string. But perhaps my priorities aren’t everyone’s. What’s your preferred option?

miller86 · March 6, 2019, 3:45am

Good idea to put things on GitHub. That said, I am not sure how to contribute to discussion there. Create an issue? Fork and submit a PR? I dunno.

The KOQ treatise was discoruaging but I guess informative. Fine, remove all sense of KOQ. It sure simplifies things. As I was pondering, it occurred to me that another useful thing to know about a quantity is also whether it is intensive or extensive. So, having a notion of quantity does seem to send us down a rabbit hole. Too bad though because in my neck of the scientific software woods (visualization tools) having the ability to exchange KOQ kind of knowledge between producers and consumers really improves data self-description.

But, some other thoughts about the proposed specification come to mind.

Is there any intention that the string representation of the units informs data processing tools how the units should be properly “rendered” (Case, Sub/Super-script, italic, abbreviations vs. long-form).
Why multiple attributes? Why not encode all necessary infromation into a string of some kind (JSON string maybe. If rationals will be used for scales, that simplifies string encoding a lot.
A ref to the current version of the standard being used was suggested to be include as a attribte on the root. Sounds like a good idea. But, if you’re gonna do that – and again I think it is a good idea – why not simply define and catalog all units in an extensible database (maybe just a markdown or rst file on github or a google spreadsheet) and then use the id a particular unit within that database as the specifiation of units within an HDF5 file. That would mean scaling and other attributes of units can be handled by extending that database. It would also mean the HDF5 coding involved to attach units to a dataset can be made maximally simple, even simpler than one string and two integer attributes.
There was a question about how to handle datasets of compound types where each member may require different units. For example, suppose you have a dataset of particles where each partcile has mass, position, velocity and acceleration? If an HDF5 application is already creating compound types for such datasets, it can do so for the “unit” attribute of such datasets as well. So, for example, the unit attribute for the hypothetical particle dataset would be a compound type having 4 members, each with name matching an associated member of the compound dataset.
Is there any expectation/intention that, maybe someday, this would support unit conversions potentially as HDF5 filters? If so, do we wanna develop some example use cases for proof-of-concept purposes?
Along the same lines as 5, it might be useful to have working HDF5 file examples that demonstrate and document the specification and keep it up to date as it evolves.

Konrad_Hinsen · March 6, 2019, 6:51am

Thomas Kluyver noreply@forum.hdfgroup.org writes:

Turning to a couple of specific points:

Konrad_Hinsen:

The current proposal leans heavily towards computational management of
units, so I’d propose to leave KOQ completely out, including in its
weaker form of SI derived units. No rad, no sr. Nor bytes or dots.

I’m all for simplicity, but to clarify: would you indicate angular units in some other way? Or are you proposing that the specification shouldn’t cover angular measurements? If the latter, is that a ‘come back to it after 1.0’ omission, an ‘angles don’t need units’ omission, or ‘this is so problematic we shouldn’t try’?

The simplest solution for angles is: rad everywhere. Don’t even talk
about it. Angles are numbers, period. Whether this is sufficient may
vary between application domains. I have seen a single program using
degrees in the 30 years that I have been doing computational science,
but perhaps degrees are more widespread elsewhere.

If you do need to support degrees, the most coherent way is to treat
them as numbers with a scale factor. Which shouldn’t be a problem in the
scheme you propose, except that you need to put π into the scale factor.
A simple though somewhat hacky solution is to introduce π as a unit.
A perhaps cleaner way is to allow π as a number literal, but then the
encoding of the scale factors becomes messy.

Another idea: remove the distinction between units and numbers. Two
string-valued fields “scale_numerator” and “scale_denominator” contain
space-separated factors, each of which is a unit name, an integer, or
π. That also removes the ambiguity of interpreting the scale factor. The
encoded quantity is the numerical value times the factors in the
numerator divided by the factors in the denominator. Such a scheme would
even accommodate pseudo-units such as bytes without any problem.

Konrad_Hinsen:

Finally, a word on non-SI units with scale factors that are not powers
of ten (inches etc.). I strongly recommend to store these scale factors
as rationals, not floats.

This is indeed what I’m going for with the scale numerator &
denominator attributes. Do you think we should say these have to be
integers, so things like inches can only be expressed as rational
fractions?

Yes. Floats are a pain that needs to be justified, either by performance
or by a very large range of potential values. I don’t see either need in
this case.

Coming from a dynamic language, it seems easy to define these
attributes as integer or floating-point type - Python will handle
either correctly with the same code. But for code in statically typed
languages, I imagine this kind of flexibility makes life harder.

Not to mention that float syntax is not 100% uniform across languages.
For example, some allow “2.e2”, whereas others don’t.

Konrad.

Ger_van_Diepen · March 6, 2019, 7:40am

I’m in favour of having units for angles. Saying that angles should always be stored as radians is the same as saying that distances should always be stored as meters or times as seconds.

Two more issues.

I was thinking that groups can be used for data hierarchies where attributes define some meta data of the group, so it should be possible to define units for attributes as well. It means there can be many unit attributes. Maybe the unit attribute name should contain the name of the data attribute or data field it describes. E.g., unit_data_name. Fields in compound data types could be named like unit_compound.field.
Can it be that a vector of values is used where the units are different? E.g., longitude, latitude, altitude to define a position on earth as deg,deg,m. Or a 2-dim array [n,3] to define a series of such positions. Probably this should not be allowed, but it should be made clear.

peter.chang · March 6, 2019, 10:12am

Hi Thomas,

Are you aware of that there is some units standard in the NeXus format?

http://download.nexusformat.org/doc/html/design.html#attributes

This is based on Unidata’s UDUNITS package.

Regards,

Peter

a.huebl · March 6, 2019, 11:16am

Hi Thomas,

in openPMD, we try to avoid string parsing and identification for units. Instead store the dimensionality (ISQ) of a data set and an additional numeric conversion factor to an absolute unit system:

github.com

openPMD/openPMD-standard/blob/1.1.0/STANDARD.md#required-for-each-record-component

The openPMD Standard
====================

VERSION: **1.1.0** (Feburary 6th, 2018)

Conventions Throughout these Documents
--------------------------------------

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be
interpreted as described in [RFC 2119](http://tools.ietf.org/html/rfc2119).

All `keywords` in this standard are case-sensitive.

The naming *(floatX)* without further specification is used if the implementor
can choose which kind of floating point precision shall be used
(e.g. *(float16)*, *(float32)*, *(float64)*, *(float128)*, etc.).
The naming *(uintX)* and *(intX)* without further specification is used if the
implementor can choose which kind of (un)signed integer type shall be used
(e.g. *(int32)*, *(uint64)*, etc.).

This file has been truncated. show original

-> unitDimension (power of the base quantities) and unitSI (numeric conversion factor)

A C++ and Python implementation can be seen in
https://github.com/openPMD/openPMD-api
https://openpmd-api.readthedocs.io/en/latest/usage/firstwrite.html

and openPMD-viewer:
https://github.com/openPMD/openPMD-viewer/

among others (project list: https://github.com/openPMD/openPMD-projects).

We also have an openPMD data reader implemented in yt-project.

Best,
Axel

miller86 · March 6, 2019, 4:13pm

Disallowing highly conventient storage modalities merely for the sake of simplifying units specification would more than likely result in no one using the proposed scheme.

Konrad_Hinsen · March 7, 2019, 6:20am

Yes and no. The common point is that you have to decide where
flexibility matters and where it doesn’t. Both too little and too much
flexibility make a system unpleasant to work with. The best compromise
may not be the same for every application domain - that’s always a
difficulty with broad-range technology such as HDF5.

However, angles are also special (different from time and length) in
that there is one “natural” unit, rad, which stems from the fact that
angles can be defined as ratios between two lengths. In computations (as
opposed to measurements), degrees are always a pain to use.

Konrad.

thomas_caswell · March 7, 2019, 1:49pm

However, angles are also special (different from time and length) in
that there is one “natural” unit, rad, which stems from the fact that
angles can be defined as ratios between two lengths. In computations (as
opposed to measurements), degrees are always a pain to use.

A major usage of hdf5 in my area (x-ray user facilities) is storing experimental measurements and in many cases the beamline staff and user prefer to think/work in degrees. Saying we should not have the ability to support that reality because it is not useful on the simulation side of things seems short sighted.

Tom

thomas1 · March 7, 2019, 3:37pm

OK, I’ll update the spec to scale with integers only.

I’m inclined not to handle this, at least for the first version of the specification. It sounds like it could get complex quickly (how do you match up dimensions in the units description with dimensions in the dataset?). Good call on making it clear, though - I’ll add a note.

I did have a brief look at this.

I think the priorities are different for designing unit-handling software (such as UDUNITS, unyt or casacore), where you want maximum expressive power, lots of available units, and the ability to add new units. To design an interchange format, you want the minimum flexibility that is practical, so different tools can implement it.

When the NeXus documentation says that units must be compatible with UDUNITS, rather than specifying the format itself, it means that to parse units from NeXus you must use UDUNITS, or reimplement it. And since it doesn’t specify a version of UDUNITS, what is a valid NeXus file might change when a new version comes out.

Thanks, I hadn’t seen that effort. I like the overall idea, which is pretty similar to what I’m suggesting, but I think having a string is valuable to make it human-inspectable without knowing the specification. The dimensions don’t have a natural order, as far as I’m aware, so (0., 1., -2., -1., 0., 0., 0.) only has meaning with the specification.

I’d agree with this. Radians might be mathematically natural, but I still had to go on Wikipedia to double check my memory of them for this discussion. Also, if you’re expecting angles at common fractions of a full circle, it’s much easier to see that in degrees. E.g. in crystallography, many unit cells involve angles of 90 or 120 degrees - it’s much easier to see that 89.992 is close to 90 than 1.570656 to pi/2.

Is there a practical downside to allowing rad and sr as symbols, even if they’re categorically different to all the other base units, which are ultimately arbitrary quantities? I don’t mind compromising theoretical purity a bit if it helps deal with messy human reality. But if this has real potential to cause confusion or difficulty in the future, maybe we need to elaborate the scheme to handle these better.

I’m still undecided on whether to have degrees as a separate symbol, or as a scale factor from radians. If we went with the latter, we could specify a small scale range around 2π/360 which tools may interpret as meaning degrees (when applied to radians), to avoid losing precision doing unnecessary conversions.

thomas1 · March 7, 2019, 4:05pm

One possible evolution of the idea is to replace the units attribute with quantity, which would be one of a specified (longer) list of words such as length, angle or voltage, and rename the scaling attributes to be explicitly scaling relative to the SI unit (inspired by OpenPMD). So kHz might be:

quantity="frequency"
units_si_scale_numerator=1000

On the plus side, this removes any need to parse a string - you just check it against a finite set of specific values. And it makes it even more explicit that the base units are SI.

On the other hand, it makes it less clear what the actual units are - for instance, not everyone will know that the kilogram is the SI base unit of mass. And it means more complexity if we do ever want to allow non-SI unit symbols (e.g. radians and degrees, or kelvin and celsius).

I currently think the drawbacks of this variation outweigh the benefits, but I’d be interested to know what other people think.

peter.chang · March 7, 2019, 5:19pm

[Image removed by sender.]peter.chang:

Are you aware of that there is some units standard in the NeXus format?

I did have a brief look at this.

I think the priorities are different for designing unit-handling software (such as UDUNITS, unyt or casacore), where you want maximum expressive power, lots of available units, and the ability to add new units. To design an interchange format, you want the minimum flexibility that is practical, so different tools can implement it.

When the NeXus documentation says that units must be compatible with UDUNITS, rather than specifying the format itself, it means that to parse units from NeXus you must use UDUNITS, or reimplement it. And since it doesn’t specify a version of UDUNITS, what is a valid NeXus file might change when a new version comes out.

I think the UDUNITS syntax is reasonably stable and minimal (https://www.unidata.ucar.edu/software/udunits/udunits-current/doc/udunits/udunits2lib.html#Syntax) and on the Java side, we use https://unitsofmeasurement.github.io/ to parse and deal with units.

Peter

paramon · March 7, 2019, 11:19pm

Hi Thomas!

A couple of additional thoughts:

Offset metadata is a useful way to apply constant correction (e.g.: detector lag) to already recorded dataset. Having offset as a part of specification is not only useful to represent Kelvin, but also to avoid ambiguity between a + k*x and k*(a + x) when offset and unit scaling are applied simultaneously.
On the other hand, it’s likely impossible to cover all forms of calibration. Maybe offset is just not special enough; if so, the specification should mention the implied/recommended order of measurement unit scaling vs calibration.
Your specification covers time duration measurements, but not measurements at moments in time: daily temperature measurements, stock prices and alike. Many agree that ISO 8601 is the most robust way to record date-time data, not offset from zero-point (epoch).
What about supporting ISO 8601 date-time series?

Thank you for your effort,
Andrey Paramonov

thomas1 · March 8, 2019, 6:16pm

If this falls out of the specification anyway, that’s fine, but I’m not sure it’s a reason to specify offsets as part of a units specification. If you’re confident in your correction you could rewrite the data. Or you could have a correction attribute without it being part of any standard. Incorrect data still has units.

I think that’s a separate question, and I don’t intend to include it in this. There are definitely use cases for numeric timestamps (duration since an epoch) as well - they take less space and are more efficient in many kinds of calculation.

paramon · March 12, 2019, 9:03am

Hi Thomas!

[thomas1] thomas1 https://forum.hdfgroup.org/u/thomas1
March 8
paramon:

Offset metadata is a useful way to apply constant correction (e.g.:
detector lag) to already recorded dataset.
If this falls out of the specification anyway, that’s fine, but I’m not
sure it’s a reason to specify offsets as part of a units specification.
If you’re confident in your correction you could rewrite the data. Or
you could have a correction attribute without it being part of any
standard. Incorrect data still has units.

My point was that is scaling and offset belong to different
specifications, it may confuse the users: is it scale*(offset + x) or
offset + scale*x?

Please note that the question doesn’t naturally arise if it’s allowed to
specify units such as eV (electron*Volts) directly instead of scaling
from SI units. In case of e.g. eV it’s unambiguous that offset is also
in eV if specified.

What about supporting ISO 8601 date-time series?
I think that’s a separate question, and I don’t intend to include it in
this.

Is it correct that timestamp datasets should miss units metadata?

Best wishes,
Andrey Paramonov

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Machine-readable units format?