lilypond-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: problems with german umlauts


From: Werner LEMBERG
Subject: Re: problems with german umlauts
Date: Fri, 26 Jan 2007 09:03:58 +0100 (CET)

> well it was an approximation (due to the previously mentionned lack
> of vocabulary)

Do you mean that your English isn't sufficient to describe the things
correctly or that the issue itself is difficult to describe?

> ISO 2022 (as well as SHIFT-JIS and other japaneses encoding of the
> same type) use indeed "artificial" 8bit characters.

`Artificial'?  Not at all!  Almost all of the registered 8bit
character sets have been in use sometime and somewhere.  The same for
the 16bit encodings.  Note that even Unicode (encoded as UTF8) has
been registered, and there exist proper escape sequences to switch
from ISO 2022 to Unicode and back to ISO 2022.[1]

> The 0-127 range is always almost compatible with ASCII

Uh, oh, you are entering muddy waters.  In old times, people haven't
actually used ASCII but officially approved variants of ISO-646.
IIRC, about 10 characters in the range 0x20-0x7F are variable.

> and there is 2 escaping character which work like double quotes.
> Inside quotes, character are multibyte (indeed it's impossible to
> store so many kanjis into only 128 slots)

Hmm, `double quotes' is perhaps a bad analogy.  The one escape
character (followed by a character set ID) activates a different
encoding for the next character only, the other does the same
permanently.

> But this option raises more issues than it brings solutions... even
> if it is still widely used in japan (ISO 2022 is still their default
> encoding for e-mailings)

The very problem is that you can encode a single character like `á' in
many ways; for example, you could switch to latin1, or to latin2,
or...  Additionally, ISO 2022 is stateful, this is, if you encounter a
bad or missing byte, the rest of the document might be corrupted.  For
this reason it has become standard to switch back the encoding at the
end of a line and restart it at the beginning of the next line.


    Werner


[1]: Just to make clear how ISO 2022 works (slightly simplified): The
     byte range 0-255 is split into four areas: The `control code'
     areas C0 (0x00-0x1F) and C1 (0x80-0x9F), and the left and right
     `graphic code' areas (GL and GR, code ranges 0x20-0x7F and
     0xA0-0xFF).  Three character codes are always at the same
     position: ESC (0x1B), SPACE (0x20), and DELETE (0x7F).  In the
     following, I ignore C0 and C1.

     In a first step, registered character sets are assigned to GL and
     GR.  Normally, GL holds the standard version of ISO-646 (which is
     equal to ASCII if combined with C0), but national variants exist.
     For example, in Japan you'll often find that the backslash (at
     position 0x5C) is replaced with the Yen sign.  GR then gets the
     `extended' character sets with either 96 characters (latin1, for
     example) or 96x96 characters (JIS X 0208 for Japanese, GB 2312
     for Chinese, etc.) or even 96x96x96 (CCCII, a Chinese encoding,
     now defunct).

     It's even possible to not use GR at all: The above-mentioned
     Japanese email encoding is using only the bytes 0x00-0x7F (since
     in former times not all email clients supported 8bit cleanly),
     switching forth and back between encodings which share the range
     0x20-0x7F.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]