[Chicken-users] multilingual fowl

chicken-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] multilingual fowl

From:	Alex Shinn
Subject:	[Chicken-users] multilingual fowl
Date:	Tue, 28 Sep 2004 04:54:12 -0500
User-agent:	Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)

Hi,

I discovered to my surprise that chicken allows characters to be any
16-bit value:

#;7> (char->integer (integer->char #xFFFF)))
65535
#;8> (char->integer (integer->char #x10000)))
0

This means you can hack the most commonly used Unicode codepoints into
the characters (more on this below).  If 21-bits were allowed we could
hack in all of the Unicode codepoints, however on a quick scan of the
source I couldn't find where this limit comes from.  Is it
intentional?  Other primitives seem to disagree on whether or not
character values can exceed 8 bits, for instance char-alphabetic? only
checks the lowest 8 bits, but char-upcase always seems correct even
when the lower 8-bits would be a lowercase character

#;1> (char-alphabetic? (integer->char 23383))
#f
#;2> (char-alphabetic? (integer->char 25991))
#t
#;3> (char->integer (char-upcase (integer->char 1633)))
1633

write-char and (write char) both fail for such characters, and while
the former is difficult to fix, the latter could be alleviated with an
escape code for characters like #\x{NNNN}.

So it seems like this may be unintentional, but just by patching up a
few loose ends like (write char) you pave the way for people to build
i18n support on top of the core chicken.

A very simple and efficient strategy that has been used to success in
other languages is to represent strings internally as UTF-8.  Whenever
this gets mentioned a lot of people just stop, gasp "non-constant time
STRING-REF?!" and dismiss the idea altogether.  However, two points to
take into consideration are 1) this could be made as a purely optional
unit, and 2) it's a lot faster than you think.  It's also easy -
attached is a rough initial implementation which still needs work but
is functional, efficient, and the hard parts are already done.  Mostly
it needs document, but I'll give a brief idea.

Users who want to take advantage of this would simply (require 'utf8)
at the start of their file.  Strings would still be strings, but
certain string primitives would be redefined to assume they were in a
utf8 encoding.  Raw access to the bytes making up the string would
still be available via procedures like string-byte-ref, but for
general coding you would just use the string procedures you're
familiar with.  An alternative would be to reverse it, so that the
traditional string-ref is unmodified but there is a new
string-utf8-ref procedure.  However it's more convenient to just
program with normal string procedures all the time and decided when
you need to (require 'utf8) and when your app doesn't need i18n.

First the bad news.  With this implementation the following procedures
take a performance hit from O(1) to O(N):

  string-ref
  string-set!
  string-length

string-length could be returned to O(1) if instead of using the native
chicken strings directly we boxed them in records and kept track of
the length.  This, however, would make C access more complicated
because we couldn't just pass the records to C functions expecting
strings.  The current approach works automatically when the C code
doesn't care about the string encoding (or assumes it's utf-8 as in
many modern libraries like Gtk).

Another catch is that string-set! might try to write over a byte with
a byte of a different encoding length.  In this case we allocate and
return a new string, so the ! in string-set! becomes just a hint (as
in reverse! and delete!), and if you really want to update the string
in place you should use the (set! s (string-set! s ...)) idiom.  This
could also be fixed with the above boxed string approach.

As for the performance loss of these procedures, the truth is it
doesn't really matter.  Gauche has the same performance (except
strings are internally boxed and string-length is fast), and many
successful string-based apps such as WiLiKi have been implemented in
it.  I have an Emacs-style text-buffer complete with markers that
makes no use of string-set! and only string-ref at offset 0.  What you
need to do is stop thinking of strings as C-style character arrays and
use higher level approaches like string-fold, string-map and string
ports.  With these procedures the asymptotic running time is the same
with only a tiny amount of overhead.

Further, the operations that you need to be fastest aren't mutations
but basic concatenation, displaying, searching (including regexps),
and passing to C.  The following procedures remain unchanged for utf-8
strings:

  string-append
  display
  write
  read
  read-line

and R5RS and SRFI-13 searching and comparison procedures
string-compare, string-< etc., string-{contains,prefix,suffix}{,-ci}
are all unchanged except for some adjustment if the start/end
parameters are provided.

Regexps are also overloaded to first translate patterns from a utf-8
encoding to a raw byte encoding before passing to the compiler.  This
is a clever translation such that for 99% of the cases the performance
will be *identical* to the non-utf-8 version, even if you're using
patterns containing utf-8 strings.  There are only two things that can
slow this down:

  1) . not followed by a *

     This is a lot more rare than you may think.  Usually you want
     ranges of characters in between some kind of delimiter.  Even in
     this case, the performance hit is almost negligable, as it
     expands to a simple byte pattern for a utf-8 character.

  2) utf-8 character classes

     These could be optimized further, but for now it translates into
     a non-capturing group like (?char1|char2|...).  It works
     correctly with ASCII ranges intermixed in the char class, but
     Unicode ranges are not yet supported.

I still need to finish SRFI-13 and the extra chicken string utils like
string-chop, and haven't started on SRFI-14.

-- 
Alex

utf8.scm
Description: Binary data

[Prev in Thread]

Current Thread

[Next in Thread]

[Chicken-users] multilingual fowl, Alex Shinn <=
- Re: [Chicken-users] multilingual fowl, Felix Winkelmann, 2004/09/29
  - Re: [Chicken-users] multilingual fowl, Alex Shinn, 2004/09/29
    - Re: [Chicken-users] multilingual fowl, Felix Winkelmann, 2004/09/29
    - Re: [Chicken-users] multilingual fowl, Alex Shinn, 2004/09/29
- Re: [Chicken-users] multilingual fowl, Sergey Khorev, 2004/09/29
  - Re: [Chicken-users] multilingual fowl, Alex Shinn, 2004/09/30

Prev by Date: Re: [Chicken-users] Error 70, what does it mean?
Next by Date: Re: [Chicken-users] Error 70, what does it mean?
Previous by thread: [Chicken-users] Error 70, what does it mean?
Next by thread: Re: [Chicken-users] multilingual fowl
Index(es):
- Date
- Thread