chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg


From: Shawn Rutledge
Subject: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date: Tue, 18 Mar 2008 19:19:40 -0700

On Tue, Mar 18, 2008 at 4:50 PM, John Cowan <address@hidden> wrote:
>  I'm not arguing that point.  I'm arguing that there should be two
>  different kinds of strings, one of which is UTF-8 and one of which
>  contains single-byte characters in an unspecified ASCII-compatible
>  encoding, *and that the Scheme core ought to know the difference*.

Maybe so.

But you would want the usual string operations to work with either
kind of string, right?  (Alex wrote about Gauche, sounds like they did
a good job with that)

>  IMHO, UTF-8 BOMs are a baaaad idea, but that's another debate.

Right, we talked a bit about that last September... not sure I yet see
why it is baaad but maybe could be considered unnecessary.

It could follow from the general principle of separating metadata from
data: Put the encoding in the extended attributes of the file, or
resource fork if you've got one.  Maybe when Windows is dead, memory
buses have attribute bits for every word, thumb drives ship
preformatted with ReiserFS v5 (optimized for holographic storage), and
tar can archive extended attributes alongside the files, the BOM could
be retired completely.  :-)

>  That turns out not to be the case.  Start Notepad and paste in some
>  random non-ASCII stuff from the Web, do a Save, and see what you get by
>  default (or in earlier versions of Windows, whether you like it or not).
>  You get little-endian UTF-16 with an explicit BOM.

Ookie.

>  FWIU, the main reason is so that Strings can be safely passed between
>  threads.

Yeah you're probably right.

>  Chicken characters are 24 bits, which is enough to handle the Unicode
>  range of U+0000 to U+10FFFF (fits in, but does not fill up, 21 bits).

Cool.

>  Not really, since in UTF-16 some characters are two code units long.
>  Java made that mistake, now partly rectified.

I thought it was still a reasonable assumption most of the time,
except for the extra few languages that required extending Unicode
beyond 16 bits?  There could be a bit somewhere to indicate whether
the string has any of those characters... but then you'd have to find
out whether it does or not, in order to set the bit.

Or have 4 types of strings: byte (restricted strings), UTF-8, and
fixed-char-size 16- and 24-bit strings.  The latter two can be in a
unicode egg.  The fixed-16-bit type would be useful often enough, and
save memory a lot of the time.  That type could be converted to
fixed-24-bit type automatically, only when necessary (when setting a
larger character into the string, when reading from a UTF-8 string
that has the larger characters, etc.)  From a user's perspective both
of those fixed-char-size types are the same: a string that has O(1)
access by index.  But converting from UTF-8 to/from the
fixed-char-size form would have to be explicit, because UTF-8 is the
"native" chicken type, and only in some cases do you need the O(1)
string-ref etc.  Nevertheless the usual string operations could still
do the right thing with all 4 types, right?  (Oh well, it all sounds
like too much work to implement, doesn't it?)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]