chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] multilingual fowl


From: Felix Winkelmann
Subject: Re: [Chicken-users] multilingual fowl
Date: Wed, 29 Sep 2004 07:51:50 +0200
User-agent: Mozilla Thunderbird 0.5 (X11/20040208)

Alex Shinn wrote:
Hi,

I discovered to my surprise that chicken allows characters to be any
16-bit value:

#;7> (char->integer (integer->char #xFFFF)))
65535
#;8> (char->integer (integer->char #x10000)))
0

This means you can hack the most commonly used Unicode codepoints into
the characters (more on this below).  If 21-bits were allowed we could
hack in all of the Unicode codepoints, however on a quick scan of the
source I couldn't find where this limit comes from.  Is it
intentional?  Other primitives seem to disagree on whether or not
character values can exceed 8 bits, for instance char-alphabetic? only
checks the lowest 8 bits, but char-upcase always seems correct even
when the lower 8-bits would be a lowercase character

#;1> (char-alphabetic? (integer->char 23383))
#f
#;2> (char-alphabetic? (integer->char 25991))
#t
#;3> (char->integer (char-upcase (integer->char 1633)))
1633

The limit is completely arbitrary. Only the lowest 8 bits are used
to tag the char, the rest can be used as required.


write-char and (write char) both fail for such characters, and while
the former is difficult to fix, the latter could be alleviated with an
escape code for characters like #\x{NNNN}.

Ok, I'll address this.


So it seems like this may be unintentional, but just by patching up a
few loose ends like (write char) you pave the way for people to build
i18n support on top of the core chicken.

A very simple and efficient strategy that has been used to success in
other languages is to represent strings internally as UTF-8.  Whenever
this gets mentioned a lot of people just stop, gasp "non-constant time
STRING-REF?!" and dismiss the idea altogether.  However, two points to
take into consideration are 1) this could be made as a purely optional
unit, and 2) it's a lot faster than you think.  It's also easy -
attached is a rough initial implementation which still needs work but
is functional, efficient, and the hard parts are already done.  Mostly
it needs document, but I'll give a brief idea.

Whoa! I'm impressed.


Users who want to take advantage of this would simply (require 'utf8)
at the start of their file.  Strings would still be strings, but
certain string primitives would be redefined to assume they were in a
utf8 encoding.  Raw access to the bytes making up the string would
still be available via procedures like string-byte-ref, but for
general coding you would just use the string procedures you're
familiar with.  An alternative would be to reverse it, so that the
traditional string-ref is unmodified but there is a new
string-utf8-ref procedure.  However it's more convenient to just
program with normal string procedures all the time and decided when
you need to (require 'utf8) and when your app doesn't need i18n.

Yes, that would be possible. But one has to keep in mind that the
compiler will replace calls to string primitives with non-unicode
aware inline C calls when compiling with -O2 or higher, or with
-usual-integrations. This would require something like

(declare (not standard-bindings string-ref ...)))

If the set of specially handled primitives is small enough, we
could of course fix the inline routines accordingly.


First the bad news.  With this implementation the following procedures
take a performance hit from O(1) to O(N):

  string-ref
  string-set!
  string-length

string-length could be returned to O(1) if instead of using the native
chicken strings directly we boxed them in records and kept track of
the length.  This, however, would make C access more complicated
because we couldn't just pass the records to C functions expecting
strings.  The current approach works automatically when the C code
doesn't care about the string encoding (or assumes it's utf-8 as in
many modern libraries like Gtk).

Another catch is that string-set! might try to write over a byte with
a byte of a different encoding length.  In this case we allocate and
return a new string, so the ! in string-set! becomes just a hint (as
in reverse! and delete!), and if you really want to update the string
in place you should use the (set! s (string-set! s ...)) idiom.  This
could also be fixed with the above boxed string approach.


How about a distinct UTF-8 string type? Conversion to and from C
calls could be handled via `define-foreign-type'.


I still need to finish SRFI-13 and the extra chicken string utils like
string-chop, and haven't started on SRFI-14.


Well, this already looks extremely promising. My idea about unicode
was actually to keep this separate (like bigloo's ucs2-... routines),
even though a "native" handling of unicode is cleaner from the user's
point of view than a separate, distinct datatype.

Anyway, I'm open for suggestions. I'm a complete newbie when it comes
to unicode things, so I can't say much.

I'd be delighted to help with the low-level issues, though
(or everything else that can be done, of course).

I just hope that one day I don't have to debug klingon source code... ;-)


cheers,
felix





reply via email to

[Prev in Thread] Current Thread [Next in Thread]