Machine-readable units format?


#1

Has there been any effort to standardise a format for units information stored in HDF5? E.g. if I have a dataset of numbers in metres per second, is there a convention for how to record that in the file?

Modules for computing with units abound in Python (Pint is one popular one I’m aware of), and it would be good to be able to store and retrieve these in a standard, interoperable format. This wouldn’t necessarily need any new APIs, it could just be conventions and specially named attributes.

I see that the introductory documentation has an example with a string attribute "Meters per second”, but this obviously isn’t easily machine readable. Another thread here suggests defining a datatype for each unit, but this doesn’t compose well (e.g. if you’ve defined metres and seconds, there’s no obvious way to combine them into metres per second).


#2

From a C++ perspective, which isn’t really portable to any other language, if you wish to compute with units, look at boost::units: . This doesn’t solve the save/restore issues though.


#3

Thanks Ray; it’s the save/restore bit that I’m interested in, however.


#4

Not that I know of.
While I understand your point: the importance of reading/writing data consistently; this task requires a consistent cooperative work. Let’s look at other, yet not less important unit: date-time. I’ve been studying with Gerd Heber: Howard Hinnant excellent implementation for modern C++.

HDF5 provides flexible framework for types and storage, but it seems that a high level ‘standard’ is required to interpret the data, at object level. It doesn’t help that the CAPI already complex by facilitating fast binary block/stream access running on wide variety of systems. Adding a detailed implementation of SI units would only increase this complexity.
For example when you save data in some measurment unit, and on the client you request data in different unit then you trigger an implicit conversion resulting in an throughput decrease – without the user knowing.

OTOH it would be nice to have a layer on top of CAPI for SI units, but then what about imperial system: widely used in USA? Python is a popular choice in data science but R has a very long history also large user base. Julia is a relatively new player yet gaining popularity.

As of C++ I am the person to blame for all mistakes in the design of H5CPP a fast, easy to use persistence layer for modern C++. If anyone is interested in working on python/R/Julia/CAPI implementation of this idea, I am opened to match it on the C++ side.

best, steven


#5

I don’t think it’s necessary to add a lot of extra complexity, or to hurt performance by doing implicit conversion when the data is read.

It’s already possible to attach human-readable units to a dataset as an attribute. I’m simply thinking of defining a set of conventions for this information, strict enough that it could be reliably understood by systems like Pint or boost::units. I’d leave actually using that information up to the application code.

If nobody has done this already, would people be interested if I put together a draft set of ideas for discussion?


#6

I am interested to look at it – and cooperate with a good accepted simple solution. However I can’t speak for others. What you are looking at is a long procedure in terms of acceptance with a nice to have component: unified way to tag datasets. – with a possibility of rejection.

My interest is to handle date-time: with all the necessary details to address the issue cross platform with ease and performance.

Do you think this is possible to do without boost library? AFAIK: boost has its own ways of doing things, and far away from being lean.


#7

Hi Thomas!

It sounds like you are looking for something like


http://www.ontobee.org/ontology/UO
or
http://www.qudt.org/release2/qudt-catalog.html

For example, mzML makes use of UO vocabulary. And there is mzML over HDF5, which yields kinda positive answer to your initial sentence :wink:

Best wishes,
Andrey Paramonov


#8

From my perspective, date-times are a similar but separate challenge. I’d be interested to figure something out for those as well, but they’re not my focus at present. I see in the docs that there is/was an H5T_TIME class, but it’s not supported and not guaranteed to be portable across platforms. I imagine it would be instructive to figure out what went wrong with that first.

I’m not planning to implement anything at first, just define how to store the information. It would certainly be possible to read and write the units without involving boost. How easy it would be to use them for calculations in C++ without boost, I have no idea, but hopefully the concepts will be compatible with any unit libraries out there.

Thanks for the links, I’ll have a read.


#9

The astronomy-based casacore library (github.com/casacore/casacore) is around for a long time and has been supporting units and measures (i.e. value, unit and coordinate frame) since its beginning. Conversions are fully supported.
Where applicable, the data are stored with units (as a standardised string) and possibly with measure info.

It has an SQL-like query language supporting units (e.g., select from table where RA > 45 deg).

Cheers,

Ger


#10

I’ve put together some ideas. I’m going to dump it into this post for now, so this will be a long message, but if people are interested, I’ll put it in a git repository so we can work further on it.


Draft HDF5 units specification

This is not an official standard, and is not endorsed by the HDF group.

Goals

Data stored in HDF5 files often refers to measurements of physical quantities, such as distance or energy. Such measurements are only meaningful if their units are known, and HDF5 attributes provide an obvious place to store units associated with data.

Units are often seen as something for humans to understand and keep track of, but there are libraries for multiple programming languages to handle quantities with units, such as Boost.Units for C++ and Pint for Python. This specification aims to define a format to record units in HDF5 files, so they can be passed into such libraries without a manual step.

The proposed format is meant to be inspectable by humans with no special software tools, but it doesn’t store units in a particularly user-friendly presentable format. Software that presents units in a user interface
will most likely want to process them into a more readable form - some suggestions are given below.

This way of specifying units is loosely inspired by the format for specifying units in the casacore library for astronomy, but it aims to reduce the number of alternative ways to specify the same units. Among other things, this means that derived units such as newtons are not represented directly, but rather as a combination of other units like kilograms, metres and seconds. Likewise, metric prefixes are folded into the scaling system.

I haven’t tried to accommodate every imaginable unit. The focus is on consistently expressing many common units. However, the fractional scaling system can accurately represent even units like inches and ounces, now that these are defined in terms of SI units.

Attributes

Information about units is stored in three attributes, units, units_scale_numerator and units_scale_denominator. These attributes may be attached to any numeric dataset (not to e.g. strings).

units is a variable-length UTF-8 string, containing a space-separated list of fields. Each field consists of a unit symbol and an optional integer power (one or more decimal digits, optionally preceded by an ASCII hyphen-minus to negate it). Omitting the power part is equivalent to it being 1.

The valid unit symbols are the seven SI base units, plus two dimensionless units (radians and steradians) whose meaning can’t be constructed from other units:

  • m (metre)
  • kg (kilogram)
  • s (second)
  • A (ampere)
  • K (kelvin)
  • mol (mole)
  • cd (candela)
  • rad (radian)
  • sr (steradian)

Unit symbols are case-sensitive, so M is not valid, for instance. They always use symbols, which are standardised, rather than names, which may vary in different languages.

Thus measurements in joules would be annotated with units 'kg m2 s-2'.

A numeric dataset may also have one or both of the attributes units_scale_numerator and units_scale_denominator. Either must be a floating point or integer number. The numbers in the dataset are multiplied by units_scale_numerator and divided by units_scale_denominator to correspond to the specified units. If either attribute is not present, it is taken to be 1.

Thus, measurements in kilometres may be expressed with units='m' and units_scale_numerator=1000.
Measurements in millimetres may be similarly expressed with units_scale_denominator=1000. Millimetres could also use units_scale_numerator=0.001, but the decimal value 0.001 cannot be stored precisely in a binary number, so using the denominator is preferred.

Likewise, customary units are described using the scale system. For example, measurements in inches (25.4 mm) could be recorded with units='m', units_scale_numerator=254 and units_scale_denominator=10000`.

Advertising compliance

The units attribute is widely used without this specification. Data following this scheme can specify the attribute units_scheme="https://url-to-be-determined#1.0" to declare itself compliant.

This attribute can be used on a dataset with units, or on a group. Using it on a group declares that all datasets accessible from that group comply with this specification, including those under subgroups and those accessed through soft and external links. Using it on the root group of a file marks the entire file compliant.

Marking a group as compliant does not necessarily mean that all (or even any) datasets have units. It means that no datasets use the attributes specified here in conflicting ways.

Declaring compliance this way is encouraged but not required. Software reading data of known provenance may expect compliance even without the marker.

The final part is a version number. This is version 1.0 of the specification. (N.B. not final - may still change without increasing version number) Changes such as adding optional attributes or new unit symbols will increase the minor part of the version number, so any data valid under version 1.0 would also be valid in 1.1. More dramatic changes would increase the major part.

Display recommendations

This scheme doesn’t represent most quantities in an idiomatic way for human understanding. Software which shows units to human users will probably want to interpret the units and present a more human-readable form. In particular:

  • Units scaled by powers of 1000 might be represented using metric prefixes, such as km or ns. Masses need special handling to avoid things like ‘kkg’.
  • Other specific units may be recognised by scaling, such as hours or litres.
  • Where units have more familiar names than a combination of the base units, the software may recognise and translate them, such as presenting V (volts) instead of ‘kg m2 s-3 A-1’.

Software designed for a particular domain may recognise additional units familiar within that domain, such as angstroms.

Possible additions

  • Temperature in Celsius & Fahrenheit can’t be expressed by scaling Kelvin. We could add these as special exceptions, or allow specifying an offset from 0.
  • If we allow specifying a zero-point for time (an epoch), durations could be used to represent timestamps (e.g. nanoseconds since the Unix epoch).
  • The relationship between radians and degrees involves scaling by an irrational number (a fraction of pi), so it cannot be precisely expressed in the fractional scheme defined above. If this is a concern, degrees could be added as a separate unit.
  • What about units such as bytes which refer to discrete countable things? Are these units at all? Arguably bytes per metre makes sense, but so does dots per inch, and we don’t consider the dot a unit.

Rejected options

  • Defining symbols for customary units. As an example of the issues, different countries disagree substantially on the size of a ‘pint’. Many widely used customary units, such as inches, pints or ounces, are now formally defined in terms of SI units, so they can be precisely represented with the fractional scaling mechanism.
  • Allowing a slash to invert units, e.g. 'm/s' instead of 'm s-1'. This would make parsing more complicated.
  • Allowing metric prefixes, e.g. km. These can be precisely represented by fractional scaling. Rejecting them excludes oddities like kkg, or special handling for kilograms. And if this scheme is ever used for data sizes, it avoids the confusion over whether a kilobyte is 1000 or 1024 bytes.
  • I considered an extra attribute for a human-readable unit description, but there’s significant potential for confusion if the human-readable units and the machine-readable units don’t match.

#11

Thanks for this post. I like it. Seems like a potentially useful thing to specify for HDF5 datasets.

Therea are aspects of this that remind me of a similar effort but that effort had the two-fold goal of standardizing how units are declared (like you propose here) as well as providing a unit conversion library. It also decoupled notions of “quantities” (time, distance, electric charge, mass, etc.) from measures of those quantities. It also maintains separate powers of the fundamental quantities in both numerator and demonimator for a given instance of a unit (which I cannot honestly tell from your description whether it is similar to different in this regard).

Two other comments/questions…why are rad/sr culled out? This ref just treats them as dimensionless ratios which are yet distingishable due to powers of distance involved. In that same ref, the answer to your question about units that count things, like bytes, would be an “amount of substance” (bytes in this case) though instead of a using unit of measure of Avagadro’s number of them (e.g. a mole), something like Mb or Kb would be more appropriate. That is a good example of why decoupling the notion of the quantity itself (amount of something) from the unit of measure (mole or Kb) is potentially useful.


#12

Using “kilogram” is a mistake. The base unit is the “gram” (g) and then the standard SI prefixes work correctly. “kg” is a Kilogram.


#13

Thanks!

Aha, I see your name on this work, so I’m guessing you’ve thought much more about this whole area than I have. I wasn’t planning to incorporate separate numerator and denominator powers for the same basic unit, but that may well just represent my ignorance - are there things that can only be represented this way? Does this relate to representing radians as m/m?

This bit certainly felt like I was missing something. Naively, angles feel like something that need units - the same angle can be expressed as 90 degrees or 1/2 pi radians, so the number needs an associated unit to have meaning. But those units seemingly can’t be expressed in terms of the base units, because the dimensions cancel out. So I just added them as extra symbols. I’m still not sure what’s the right way to tackle these.

Thanks, this is interesting. Is there a standard set of definitions of qualities for measurements?

From my perspective, bytes and moles seem like they would be different qualities as well. They’re both based on counting, but bytes seem like “amount of data” rather than “amount of substance”. A megabyte of carbon doesn’t make sense (even if 10^6 atoms weren’t an infintesimal amount).

I’ve seen that various libraries for working with units let the user define units; it looks like in SAF you have to define your own base units, if I’m reading the documentation correctly. This is a bit different for what I’m trying to do: for meaningful interoperation and for simplicity, I want to define a limited set of concrete units and not let people define custom units.

SI defines the kilogram as the base unit. People much smarter than I chose this, and I’m definitely not going to try to improve on the SI system. :wink: It’s a historical quirk that the base name applies to 1/1000 of the base unit.


#14

I’m the primary author of the unyt package, which uses hdf5 to write data with unit metadata to disk this is very interesting to me. I haven’t read through this in detail but just wanted to say that this one point below is very strange to me.

It’s really not that hard to write a parser that deals with that. Here’s the relevant bit of the unyt source code that does that:


#15

Thanks Nathan, I’ll have a look at what unyt is doing.

I can imagine it’s easy enough to parse metric prefixes. But it’s even easier to leave them out, and it avoids various possible ambiguities, if other units are added later (e.g. T for tera- or for tesla). I’m trying to keep the allowed grammar as simple as possible - it’s a machine-readable format first, not a user interface.

The fractional scaling system can do everything the metric prefixes can do, as well as describing units like inches, without putting any extra complexity into the string.


#16

I wasn’t planning to incorporate separate numerator and denominator powers for the same basic unit, but that may well just represent my ignorance - are there things that can only be represented this way? Does this relate to representing radians as m/m?

Well, as I understood things (BTW, I didn’t write the Quantities and Units part of that interface I ref’d…I just worked with the guy who did ;), there are the 7 primal quantities [time, distance, mass, charge, thermodynamic temperature, luminous intensity and “amount of stuff”]. All other quantities of interest are derived from these 7 such that any quantity of interest can be represented as 2, 7-tuples (one for numerator and one for demoninator) of powers of the primals. So, for example the quantity of “acceleration” would be represented by [0,1,0,0,0,0,0]/[2,0,0,0,0,0,0] indicating distance of power 1 in the numerator and time of power 2 in the denominator. Whereas, cooresponding units of measure for time might be “fortnight” (e.g. a period of time of two weeks) and for distance might be “footlong” (e.g. 12 inches) allowing one to define, for example, the gravitational acceleration of Earth as 47x10^12 “footlongs fortnight-2” using your schema. But, both the [0,1,0,0,0,0,0]/[2,0,0,0,0,0,0] as well as the “footlongs fortnight-2” are used to encode the complete specification of the units. If we argree NOT to cancel common terms in the numerator and denominator, we can then distinguish between a planar angle ([0,1,0,0,0,0,0,0]/[0,1,0,0,0,0,0,0] ) and a solid angle ([0,2,0,0,0,0,0]/[0,2,0,0,0,0,0]). And, you can then have those quantities measured in terms of radians or degrees (planar angle) or steradians or degrees (solid angle)

From my perspective, bytes and moles seem like they would be different qualities as well. They’re both based on counting, but bytes seem like “amount of data” rather than “amount of substance”. A megabyte of carbon doesn’t make sense (even if 10^6 atoms weren’t an infintesimal amount).

Agreed. I think what does make sense though is separating the thing being quantified (e.g. number of things) from the units used to measure that quantity. You can have a couple or a few things. Or, you can have a dozen things or a baker’s dozen or a mole of things…etc. These are all expressions of the “amount” quantity primal. I think of a mole is just like a dozen. It just that mole is a carefully constructed number of things that is convenient when talking about molecules or atoms.

Thanks, this is interesting. Is there a standard set of definitions of qualities for measurements?

Did you mean “quantities” here instead of “qualities”. If so, yes, I think the NIST ref (above) does a decent job of describing that.

I want to define a limited set of concrete units and not let people define custom units.

Ah, ok, I hadn’t understood that. Seems reasonable if you wanna restrict it to that case. The more general approach, of having to define a quantity in terms of the 7 primals first and then specifying the units for that quanity, provides a fully extensible approach which needn’t have to worry each time a new unit in the standard needs to get defined. In other words, the standard defines a basis for constructing any unit of measure. But, I agree, it is more complex too.


#17

Are we Sure they are smarter than us? Why would they do this? Everything else in SI works with the SI prefixes so why would you they do this? There is a difference in what most people use as the base unit as what the base unit actually is. But, like you said, somebody else define it the other way. Crazy.


#18

Allowing metric prefixes, e.g. km. These can be precisely represented by fractional scaling. Rejecting them excludes oddities like kkg, or special handling for kilograms. And if this scheme is ever used for data sizes, it avoids the confusion over whether a kilobyte is 1000 or 1024 bytes.

I haven’t gone over this in detail but I wanted to say that this is a very strange requirement. Writing a parser that handles this is not so bad.

I also want to mention that I have a unit library in python that supports HDF5 as an output format via h5py.

For what it’s worth, here’s the code that deals with parsing SI prefixes:


#19

I want to define a limited set of concrete units and not let people define custom units.

Ah, ok, I hadn’t understood that. Seems reasonable if you wanna restrict it to that case. The more general approach, of having to define a quantity in terms of the 7 primals first and then specifying the units for that quanity, provides a fully extensible approach which needn’t have to worry each time a new unit in the standard needs to get defined. In other words, the standard defines a basis for constructing any unit of measure. But, I agree, it is more complex too.

Having a standard way to store units in HDF5 is vital and it’s great to see this initiative! Even if not for the first version, may I suggest that thought be given to extensibility so that researchers are not deterred from storing units with their data simply because their favourite units are not in a limited set of concrete units.

An ontological way of doing this is the Unit-of-Measurement-Expressions ontology, something I made use of a few years ago when defining a meta-model for biosignal data (including using UOME descriptions with pint). Others have described different ways to define custom units. What is important is that custom units are specified in some way that allows the underlying definitions to evaluated, so as to be able to answer the questions: “Does a valid conversion exist between these two unit definitions?” and “What is the scaling factor and offset for converting data values between these two definitions?”. Actual on-the-fly conversion of data streams, using vector scalar multiply/add, should be fast with modern CPUs.


#20

What is important is that custom units are specified in some way that allows the underlying definitions to evaluated, so as to be able to answer the questions: “Does a valid conversion exist between these two unit definitions?” and “What is the scaling factor and offset for converting data values between these two definitions?”. Actual on-the-fly conversion of data streams, using vector scalar multiply/add, should be fast with modern CPUs.

:+1: