On Thu, Oct 17, 2013 at 03:16:40PM +0200, Konrad Hinsen wrote:
Pierre de Buyl writes:
> So, to get back to Peter's message:
>
> I propose that we follow udunits grammar by restricting it similarly
to Mosaic.
> For reference, Mosaic's definition is
> """
> The value of the units field is a text string in ASCII encoding. It
contains a
> sequence of unit factors separated by a space. A unit factor is a
unit symbol
> optionally followed by a non-zero integer which indicates the power
to which
> this factor is taken.
> """
>
> I would remove the constants defined ("c" and "Nav"), however.
The current unit list is a first draft, to be revised before version
1.0 of Mosaic. You are completely right about "Nav", which is the same
as "mol" and thus redundant. However, "c" frequently occurs in derived
unit, e.g. "cm-1 c" for frequency, which is heavily used in
spectroscopy.
In any case, this kind of constant would go in a module and not in the
base
spec.
> We may want to add "a unit string must be parseable by udunits"?
The problem with that statement is that we don't control udunits. In
general, it's not a good idea to define a data format by the
capacities of a piece of software. It's fine to have such a comment as
a statement of intention, of course.
Ok.
Felix Höfling writes:
> I find udunits' grouping into SI-base units, SI-derived units etc.
very
> reasonable. Let's keep it for H5MD rather than introducing a
different
> subset.
That was my original idea for Mosaic, but I changed my mind for the
following reasons:
1) The point of having a restricted set of units is to permit error
checking. Allowing a unit that is more likely to be a typo than
a choice is ultimately of no benefit. A general-purpose library
such as udunits can't limit the allowed units, but a domain-specific
format such as Mosaic can.
2) The distinction between SI-base and SI-derived is logical for a
metrologist, but irrelevant for practical use. I don't expect
SI-base to be sufficient for much of molecular data, if only
because of the lack of energy units.
3) Fewer units means a reduced risk of errors if automatic conversion
is attempted (see below).
> Actually, whether a reader can "understand" a small or large set of
units
> is mainly a matter of the database defining the units. Do I overlook
> something here? Why not copying the full list from udunits?
See 1) above.
Also, to get an idea of what's possible with udunits I had to play a bit.
Providing an explicit list seems simpler.
> BTW, a more advanced functionality that discriminates between
"simple" and
> "advanced" readers is automatic conversion between units ...
Indeed, but conversion is a very tricky business. SI has two traps for
unit converters:
- Dimensionless units: rad, sr, and mol
Is pi dimensionless or measured in rad? Both make sense, and
automatic
conversion needs to know which convention was used.
I am actually considering to remove "rad" from the allowed units in
Mosaic, and make "deg" a dimensionless constant equal to 180/pi.
That's much closer to the reality of unit use in computational
chemistry
than the SI system.
- Dimensionally equal but incompatible units: 1/s, Hz, Bq
It's OK to convert Hz and Bq to 1/s, but not among each other.
Converting 1/s to Hz or Bq is in general not allowed. The problem
disappears if Hz and Bq are not allowed.
Ok, so we need to settle on what can go into a unit.
(Most of this is copied from Mosaic, which means I should not forget to
add a
license statement somewhere. BTW, Konrad, do you know if we can include
your
CC-BY in our GPL "code"?)
"""
"unit" is a scalar attribute of type variable length string. "unit"
consists of
a sequence of unit factors separated by a space. A unit factor is either
a
number (an integer or a decimal fraction) or a unit symbol optionally
followed
by a non-zero integer which indicates the power to which this factor is
taken. A
unit symbol may include a SI-prefix factor.
Examples:
- "nm3" stands for cubic nanometers
- "nm ps-1" stands for nanometers per picosecond
- "60 s" stands for a minute
Each unit symbol may occur only once in the units field. There may also
be at most one numeric factor, which must be the first one.
"unit" may be encoded as ASCII or UTF8.
The list of available symbols, in the case where no "units" module is
present,
is XXX.
"""
P