[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pika-dev] other char/string things
From: |
Tom Lord |
Subject: |
[Pika-dev] other char/string things |
Date: |
Sat, 24 Jan 2004 17:13:27 -0800 (PST) |
It's been called to my attention that I forgot to specify a "control"
buckybit. Oops.
Just for the record, what we eventually need to tweak-things into
is to have:
C-
as a character name prefix, mean to set the control buckybit.
Thus,
(char->integer #\C-a)
is some large integer -- not U+0001
Additionally, character _names_ (not bucky-like prefixes) are needed
for ASCII control characters:
ctl-a == U+0001
ctl-[ == U+001B
ctl-@ == U+0000
ctl-space == U+0000
etc.
We currently have four buckybits and adding control will make five.
I think that there is a total _possible_ number of 8 buckybits
(because 32 == 21 + 8 + 3 and 3 == log_2(8) and 2*sizeof (t_scm_word)
is 8 on a 32 bit machine --- in other words,
tag-bits + codepoint + buckybits == 32
Of those 8 possible buckybits, I'd like to reserve 2 for purely
internal use and to make use of these in uni_utf32 strings. The
purpose of these extra bits in UTF32 strings is to represent
ill-formed and unrepresentable sequences of Unicode characters. (See
enclosed. There are variations on that idea using just 1 bit and
variations using the four values of 2-bits differently -- we can work
that out as it comes up which won't be for a while.)
So that will leave one unallocated bit in characters.
-t
> From: Tom Lord <address@hidden>
> [To: gnu-arch-users]
[....]
> > Just decide how many ISO 10646 planes you want to support, and use
the
> > appropriate number of bits (21 is fine). Use an additional bit to
> > squeeze in 256 code positions you might want to use to represent
invalid
> > UTF-8 input data (so you have round-trip capability even for binary
> > files accidentally interpreted as UTF-8).
> I'm not giving UTF-8 that kind of priveleged role in Pika.
> However, it's a fascinating idea and I thank you for it. It solves a
> nasty little problem I was facing.
> Let's suppose that I use up two buckybits purely internally to
> represent "ill-formed-characters". That is to say: users would have
> 6 buckybits, not 8, and there's two bits per character for internal
> use.
> I don't actually need 2 bits --- I just need a bit more than 1.5 and
> current hw isn't too good at fractional (let alone irrationally
> fractional) bits yet.
> Now I can have a string like:
> <00 codepoint><00 codepoint><01 bogus><10 bogus><10 bogus><00 codepoint>
> ^
> |
> X
> in which <01 bogus> and <10 bogus><10 bogus> are ill-formed combining
> character sequences that should be treated as distinct graphemes by
> procedures like GRAPHEME-LENGTH and GRAPHEME-REF.
> Now if I insert a string of the form:
> <01 bogus>
> at point X in that string, then the result is:
> <00 cp><00 cp><01 bogus><10 bogus><01 bogus><01 bogus><00 cp>
> \ /\ /\ /
> \ / \ / \ /
> modified insertion modified by
> by insertion
> insertion
> In other words, such an insertion has to change adjacent characters to
> preserve the "bogus grapheme" boundaries.
> The upshot of this is that I can pun a single string as both a
> sequence of codepoints and a sequence of (possibly ill-formed)
> combining sequences -- and that is, btw, sufficient to provide the
> round-tripping ability you were after not only for UTF-8 but for
> UTF-16 and UTF-32 as well. Total win -- thanks again.
- [Pika-dev] other char/string things,
Tom Lord <=