monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] second go at i18n spec


From: graydon hoare
Subject: [Monotone-devel] second go at i18n spec
Date: Mon, 08 Dec 2003 18:01:35 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031115 Thunderbird/0.3

here's a second go at specifying i18n behavior for monotone. unless you have serious qualms with it, I'm thinking I will implement something like this in the next couple weeks.

-graydon

--snip--

monotone internationalization specification
===========================================

0. overview:

    monotone initially dealt with only ASCII characters, in file path
    names, certificate names, key names, URLs, and packets. some
    conservative extensions are provided to permit internationalized
    use. these extensions are broadly confined to user interface
    concerns and, with the exception of manifests (which are gzipped
    and base64 encoded for storage and transport anyways), do not
    alter the stored or transmitted representation of data in
    monotone. monotone's storage and transmission interface is still
    7-bit ASCII.


1. general terms:

  character set conversion:

    the process of mapping a string of bytes representing wide
    characters from one encoding to another.  conversion is specified
    by a lua hook "char_conv" which takes a filename and returns a
    table of two strings: the first represents the "internal"
    (database) charset, the second represents the "external"
    (filesystem) charset.

  line ending conversion:

    the process of converting platform-dependent end-of-line codes
    (0x0D, 0x0A, or the pair 0x0D 0x0A) from one convention to
    another. conversion is specified by a lua hook "line_conv" which
    takes a filename and returns a table of two strings: the first
    represents the "internal" (database) line ending convention, the
    second represents the "external" (filesystem) line ending
    convention. each string should be one of the three strings "CR",
    "LF", or "CRLF". if line ending conversion is always performed
    after character set conversion, when both are enabled.

  normal form conversion:

    character set and line ending conversions done between a "system
    form" and a "normal form". system forms are specified by a pair of
    hooks "system_charset" and "system_line", and if either hook is
    undefined monotone attempts to infer a default using the LC_CTYPE
    locale category, and other system-dependent features, if
    present. the normal form is the UTF-8 character set and the 0x0A
    (LF) line ending form.

  LDH:

    letters, digits, and hyphen: the set of ASCII bytes 0x2D,
    0x30..0x39, 0x41..0x5A, and 0x61..0x7A.

  stringprep:

    RFC 3454, a general framework for mapping, normalizing,
    prohibiting and bidirectionality checking for international names
    prior to use in public network protocols.

  nameprep:

    RFC 3491, a specific profile of stringprep, used for preparing
    international domain names (IDNs)

  punycode:

    RFC 3492, a "bootstring" encoding of unicode into ASCII.

  IDNA:

    RFC 3490, international domain names for applications, a
    combination of the above technologies (nameprep, punycoding,
    limiting to LDH characters) to form a specific "ASCII compatible
    encoding" (ACE) of unicode, signified by the presence of an
    "unlikely" ACE prefix string "xn--". IDNA is intended to make it
    possible to use unicode relatively "safely" over legacy
    ASCII-based applications. the general picture of an IDNA string is
    this:

      {ACE-prefix}{LDH-sanitized(punycode(nameprep(UTF-8-string)))}

    it is important to understand that IDNA encoding does *not*
    preserve the input string: it both prohibits a wide variety of
    possible strings and normalizes non-equal strings to supposedly
    "equivalent" forms.

    by default, monotone does *not* decode IDNA when printing to the
    console (IDNA names are ASCII, which is a subset of UTF-8, so this
    normal form conversion can still apply, albeit oddly). this
    behavior is to protect users against security problems associated
    with malicious use of "similar-looking" characters. if the hook
    "display_decoded_idna" returns true, IDNA names are decoded for
    display.


2. filenames:

  - filenames are subject to normal form conversion.

  - filenames are subject to an additional normal form stage which
    adjusts for platform name semantics, for example changing the
    Windows 0x5C '\' path separator to 0x2F '/' and changing the root
    designator from "C:\" to "/". this extra processing is performed
    by boost::filesystem.

  - FIXME: what do we do about case sensitivity on Windows?

  - the internal form of filenames has additional structural
    restrictions:

    - a filename is a nonempty sequence of path components, separated
      by byte 0x2F (ASCII / ), and without a leading or trailing 0x2F.

    - a path component is a nonempty sequence of any UTF-8 character
      codes except the path separator byte 0x2f and any ASCII "control
      codes" (0x00..0x1F and 0x7F).

    - the path components "." and ".." are prohibited.

  - manifests are constructed from the normal form (UTF-8 / LF). the
    LC_COLLATE locale category is *not* used to sort manifest entries.


3. file contents:

  - files are subject to character set conversion and line ending
    conversion.

  - file SHA1 values are calculated from the internal form of the
    conversions.


4. UI messages:

  - UI messages are displayed via calls to gettext().


5. URLs:

  - URLs are read on the command-line and subject to normal form
    conversion. each component of a URL is then subject to a different
    further normalization or conversion:

    - host names: split at '.', each component subject to IDNA
      encoding, rejoined.

    - path components in HTTP URL: all ASCII control codes (0x0..0x1F
      and 0x7F) and code 0x20 (ASCII SP) are prohibited; 10 reserved
      ";/?:@&=+$," characters and all other UTF-8 codes outside the
      unreserved class of 72 chars (62 alphanum, 9 mark "-_.!~*'()"
      and one escape "%") are escaped into (%xx) hex form. roughly
      following see RFC 2396.

    - mailto addresses and NNTP group names: split at '.' and '@',
      each component subject to IDNA encoding, rejoined. roughly following
      suggested procedure in draft-faerber-i18n-email-netnews-names-00.txt.

   after these processing stages, URLs are stored internally as ASCII.
   the invariant is that a URL inside monotone contains only the 72
   non-reserved bytes and the 3 reserved bytes "/@." described in RFC
   2396.


6. cert names:

  - read on the command line and subject to normal form conversion and
    IDNA encoding as a single component. the invariant is that a cert
    name inside monotone is a single LDH ASCII string.


7. cert values:

  - cert values may be either text or binary, depending on the return
    value of the hook "cert_is_binary". if binary, the cert value is
    never printed to the screen (the literal string "<binary>" is
    displayed, instead) , and is never altered or character
    converted. if text, the cert value is subject to normal form
    conversion, as well as having all UTF-8 codes corresponding to
    ASCII control codes (0x0..0x1F and 0x7F) prohibited in the normal
    form, except 0x0A (LF).


8. key names:

  - read on the command line and subject to normal form conversion and
    IDNA encoding as an email address (split and joined at '.' and '@'
    characters). the invariant is that a key name inside monotone
    contains only LDH, '.' and '@' characters.

9. packets:

  - packets are still 7-bit ASCII. the characters permitted in packets
    are the union of these character sets:

     - the 64 characters of base64 encoding
     - the 16 characters of hex encoding
     - LDH, '@' and '.' characters, as required for key and cert names
     - '[' and ']'
     - ASCII codes 0x0D (CR), 0x0A (LF), 0x09 (HT), and 0x20 (SP).

10. explicit rename certs:

  - incompatible change: whitespace delimiter changed to UTF-8 code
    0x0A, to permit UTF-8 code 0x20 in filenames.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]