discuss-gnustep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Language Setup Document


From: Alexander Malmberg
Subject: Re: Language Setup Document
Date: Tue, 08 Jul 2003 21:05:12 +0200

Kazunobu Kuriyama wrote:
> Pete French wrote:
[snip]
> > O.K. I didnt realise that there was a C standard for multi-byte . Have
> > gone and looked it up (9and am trying to understyand what the hell it is
> > trying to do).
> 
> You'll find it the hell literally...

Fortunately, the c standard multi-byte stuff isn't very relevant here.

> >> I'm afraid what you observed with Russian characters is irrelevant to
> >> the multibyte
> >> representation. Actually, as you noted, each of them is represented with
> >> 8-bit.
> >
> > ...or can be. They are represented as 16 bit in Unicode (and have to be
> > as their values are outside the rnage of a single byte).

Unicode characters don't have any inherent width, and you can't strictly
represent something as unicode in this sense. Unicode just gives each
character a code point. You only get a width after encoding it, and this
width depends on which encoding is used (and can even depend on the
context, if the encoding is stateful).

For example, when encoding using iso8859-1 (latin1), unicode characters
0-255 are represented with 8 bits, and other characters can't be
represented. When using koi8-r, unicode characters 0-127, 0x430, 0x431,
0x432, and a bunch of other cyrillic characters are represented with 8
bits, and other characters can't be represented. When using ucs2, which
(simplified) is what NSString uses internally, characters with code
points <=0xffff are represented with 16 bits, and other characters can't
be represented.

> > I just tried to convert 0x400 tomutibyte representation to see what
> > would happen though, and got a sinzezero byte ?! (but no error return)
> 
> Broadly speaking, Unicode defines a one-to-one mapping from an integer
> to a glyph
> or the shape of a character,

This is wrong. Unicode defines abstract characters. These do not map
one-to-one to glyphs.

> and is independent of encoding.  To display
> a unicode
> datum on the screen, you need to convert it into an object in another format
> in such a way that the underlying API (X library etc.) can accept it.

Yes, glyphs.

> The actual character to be displayed is determined by the locale or encoding
> in use, which relates the resulting object to the position of a given
> font table.

Unfortunately, X's font handling is a big mess, which messes up the neat
definitions and terminology. However, from a high-level pov (eg. as -gui
sees it), you can think of the reencoding of the unicode characters as
glyph generation.

[snip]
> The problem we have been talking about lies in how to convert a Unicode
> datum into another one which can be accepted by the underlying API to
> display
> the character. The ways of conversion include utf8, multibyte, wide
> char, ...
> The real hell on the earth.

When using X, yes. When using eg. freetype, there's no problem.

- Alexander Malmberg




reply via email to

[Prev in Thread] Current Thread [Next in Thread]