[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets |
Date: |
Tue, 3 Feb 2004 11:05:17 -0800 (PST) |
> From: Aaron Bentley <address@hidden>
> On Tue, 2004-02-03 at 13:16, Tom Lord wrote:
>> So what I am (tentatively) willing to do is this: if there's enough
>> programmers who both (a) want to help with my software and (b) are
>> against unification -- I'm willing to have libhackerlab (hence Pika
>> and arch) use an _extended_ Unicode. Standardizing, within those
>> libraries and programs on assigning-by-convention some private-use
>> codepoints to un-unified characters.
> I'm an ignorant English speaker, but would it be possible to make the
> private characters be combining character? That is, you'd have a
> combining character to indicate that the next character is Chinese,
> Japanese, or Korean, then use the unified Han character.
Not cleanly, no. You have that backwards. The combining character
would _follow_ the unified character.
<unified><language-tag><unified><language-tag>.....
That is, indeed, perfectly workable logically -- but not, I think, the
best way to do it. It's a _plausible_ approach in that I don't think
there's enough private-use 16-bit code-space to squeeze unified
ideographs into the basic multilingual plane (codepoints that fit into
16-bits) --- so in UTF-8 and UTF-16 you'll be paying a comperable
space penalty anyway. But:
First: I happen to believe [long explanation of why elided] that
internally to applications, UTF-32 representations are actually
important. The combining character approach doubles the length of a
string in UTF-32.
Second: the unified characters are codepoints (at least mostly so --
I'm not certain) -- thus admit some useful algorithms that operate on
(fixed width) codepoints rather than (unbounded width) combining
character sequences. If you really want to refute-by-demonstration
unification, then the alternative demonstrated should propose
alternative codepoints, not combining character sequences. For
example: if you wrote an Emacs based on libhackerlab, then compiling a
version that works on "non-unified-Unicode" should be little more than
a matter of substituting the raw-data Unicode databases with some
extended databases.
That said: it's "not my problem" -- it's not my place to pick one of
these two approaches over the other and neither of my arguments are
absolute.
>> That wouldn't provide interoperability with everything in the world --
>> far from it. For example, it would be (at best) a long time before
>> browsers would recognize the non-standard characters.
> That way, the raw output would be legible (though ugly) for non-savvy
> programs, and conversion to standard Unicode would be a matter of
> deleting the combining characters.
(Lossy) conversion to standard Unicode is trivial with either
approach.
Display -- hmm.... y[]o[]u[] d[]o[] h[]a[]v[]e[] a[] g[]o[]o[]d[]
p[]o[]i[]n[]t[] t[]h[]e[]r[]e[].
-t