chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg


From: John Cowan
Subject: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date: Tue, 18 Mar 2008 19:50:25 -0400
User-agent: Mutt/1.5.13 (2006-08-11)

Shawn Rutledge scripsit:

> That is a huge advantage.  I think unless there are some
> insurmountable gotcha's, or it causes major efficiency problems, there
> are some good arguments for using UTF-8 for strings in Chicken.

I'm not arguing that point.  I'm arguing that there should be two
different kinds of strings, one of which is UTF-8 and one of which
contains single-byte characters in an unspecified ASCII-compatible
encoding, *and that the Scheme core ought to know the difference*.

> Less common you mean?  I think ASCII is the most common representation
> for everything.  The popularity of XML goes to show what pains people
> are taking to make data human-readable.  (I disagree with the need for
> that a lot of the time, but whatever.)  Source code written by
> non-English speakers is usually ASCII nevertheless.

For a long time, most languages gave you no choice.  I understand that one
of the reasons that Java took off early in Japan was that it was finally
possible to write programs with Japanese identifiers, not just comments.

> My favorite editor, Scite, BTW supports UTF-8 nicely... it
> preserves the BOM if it is there, assumes ASCII if it is not there,
> and can be told to switch to UTF8 mode if the file does not have a BOM
> but actually is UTF8... then when you write the file it prepends the
> BOM.  All exactly as it should be.

IMHO, UTF-8 BOMs are a baaaad idea, but that's another debate.

> I am seeing fewer web pages in other 8-bit codepages (like KOI8-R,
> CP1251 etc.) than there used to be, and/or modern browsers are doing a
> better job detecting the codepage and making it transparent anyway.

The latter is the case.  I considered adding Mozilla's charset detector
to my TagSoup HTML parser, but the detector is about 10 times as big
as the parser, so I didn't.

> I disagree.  Text and HTML files you may find lying about on hard
> drives and web servers all over the world tend to be either ASCII or
> UTF-8, as far as I've seen.  Windows programs may use UTF-16 for
> string variables in memory, and maybe for serialization to "binary"
> files, but not for files that are meant to be human-readable.

That turns out not to be the case.  Start Notepad and paste in some
random non-ASCII stuff from the Web, do a Save, and see what you get by
default (or in earlier versions of Windows, whether you like it or not).
You get little-endian UTF-16 with an explicit BOM.

> Insertion has a linear cost though, because the string is a contiguous
> array, right?

Sure.  Many O(1) algorithms become O(N) with UTF-8, which is why some
people want fancier implementations.

> This is probably the reason Java sidestepped the issue by specifying
> that strings are immutable.

FWIU, the main reason is so that Strings can be safely passed between
threads.

> So char has to be 16 or 32 bits right?

Chicken characters are 24 bits, which is enough to handle the Unicode
range of U+0000 to U+10FFFF (fits in, but does not fill up, 21 bits).

>   (depending on how much of
> Unicode we wish to support... 16 bits is almost always enough)  

It depends on the language of the text you are processing.
We should certainly not handle less than the full Unicode range.

> you do string-ref on a UTF8 string it will need to return the Unicode
> character at that character index, not the byte from the bytewise
> index, right?  Then unfortunately you have to iterate the string in
> order to count characters, can't just do an offset from the beginning.
>  (This is where UTF-16 as an in-memory representation has an
> advantage.)

Not really, since in UTF-16 some characters are two code units long.
Java made that mistake, now partly rectified.

>  When doing in-place modifications with strings that actually have
> non-ASCII characters, actual Unicode is more efficient, so it would be
> nice to be able to switch to that representation when it has
> advantages.  (like Windows does for string variables)

Your understanding is out of date.  All of UTF-8, UTF-16, and UTF-32 are
equal, and are considered "actual Unicode".

-- 
Yes, chili in the eye is bad, but so is your    John Cowan
ear.  However, I would suggest you wash your    address@hidden
hands thoroughly before going to the toilet.    http://www.ccil.org/~cowan
        --gadicath




reply via email to

[Prev in Thread] Current Thread [Next in Thread]