guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: I18N/M17N?


From: Jim Blandy
Subject: Re: I18N/M17N?
Date: 26 Apr 2001 10:23:20 -0500

Masao Uebayashi <address@hidden> writes:
> Just a curiousity, but is there any plan that Guile to be I18N'ed, or
> better M17N'ed? Or inporting MULE features from Emacs???

I was supposed to work on this, but I don't spend time on Guile any
more (I need to resign as a maintainer), so I encourage anyone who's
interested to pick up the ball here.  Read guile-core/doc/mbapi.texi.
There's even some code, on a branch called jimb_mb_branch_1.


In the long run, Guile is supposed to replace the Emacs Lisp
interpreter in Emacs.  For that integration to work in a useful way,
Guile strings and Emacs Lisp strings need to be the same objects, and
in general, Guile and Emacs should follow the same M17N approach.

Or at least, that used to be the consensus.  There are a bunch of (I
suspect inevitably) controversial design decisions we made a while
back which the current contributors to Guile don't like.  I don't know
how that's going to be sorted out.

For example:

To avoid wasting space, you'd like to be clever about your
representation of strings, so that users see the memory usage they've
come to expect when dealing with ASCII text, but can work with
non-ASCII text through the same old string functions they already
know.  None of this `wide char variant' garbage.

There are many approaches, of which we've really considered two:

a) Use UTF-8, or some other variable-width encoding.  This is what
   Emacs Lisp, Tcl, and Perl do.

b) Provide several different string representations, one allocating one
   byte per character, another allocating two bytes per character, etc.
   Storing a character whose encoding requires two-bytes in a one-byte-
   per-character string would automatically promote the string to a
   two-byte-per-character string.

Using UTF-8 or some other variable-width encoding makes indexing into
a string more expensive.  However, a lot of C code that works with
strings needs no changes to support UTF-8.  ("A lot" is certainly a
fuzzy term; this is from my limited experience with M17N.)

Having several different string representations allows indexing to
remain constant time.  However, it requires C code that works with
strings to either 1) always go through accessors to reach the string
contents, or 2) check the width and write different code to access
different widths.  In other words, all C code that works with strings
would need to change.

I'm trying to be balanced here, but let me present the argument for my
preferred implementation:

    A long time ago, Guile used a uniform-width representation like b)
    for multilingual strings.  While this approach has technical
    advantages, in our experience it has serious, um,
    ``intercultural'' problems.  What tended to happen was that
    programmers would check the width of the string, signal an error
    on wider strings, and then write C code that used char * they way
    they were accustomed to.  In the midst of working on some other
    real problem, they didn't want to have to think about M17N (even
    though it's usually trivial).  And if they were using English or a
    Western European language, they never noticed a problem.

    The observation behind a) is that a lot of multilingual code
    written this way looks very much like ordinary C code.  In fact,
    for copying, concatenation, searching for ASCII characters,
    searching for substrings of either ASCII or non-ASCII characters,
    and some other operations, your code is completely unchanged.

    And in many of the cases where your C code needs to be aware of
    the encoding, your code needs to be different anyway just because
    you're using Unicode.  You shouldn't use the old <ctype.h>
    functions, for example; you need to use a multilingual table.

    But fundamentally, I can't argue (and don't believe) that
    "everything will be easy if we do this, and hard if we do that,"
    so the issue is controversial.  Each approach has its advantages
    and disadvantages, and which one you prefer depends on how you
    imagine the process of writing code to work with each one.

So, that's the state of the union.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]