emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs rewrite in a maintainable language


From: David Kastrup
Subject: Re: Emacs rewrite in a maintainable language
Date: Sun, 18 Oct 2015 18:56:57 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux)

"John Wiegley" <address@hidden> writes:

>>>>>> Eli Zaretskii <address@hidden> writes:
>
>> One of the major lessons Emacs development learned since Emacs 20.1
>> is that raw bytes happen as part of text (a.k.a. "strings"), and
>> therefore there's a need to support a mixture of these two in the
>> same buffer/string. I think that's something Guile should support as
>> well, as that will make it a more powerful and flexible extension
>> language, able to deal with a wider range of real-life situations.
>
> I'd like to second Eli's recommendation. In real life, encoding and
> decoding of bytes to and from characters (codepoints) is never a
> simple problem. We do need good flexibility here.

Personally I have no problem with an implementation insisting on certain
properties for its internal encoding.  But that implies that "internal
encoding" and "external UTF-8" may diverge when "external UTF-8" does
not exclusively contain valid UTF-8.

Maintaining that distinction for GUILE should not be hard as currently
its internal encoding is either Latin-1 or UCS-32 so it is not like it
currently _has_ an internal UTF-8 for strings even though it has a
number of functions taking UTF-8 input.

However, if "internal encoding" is not the same as "valid UTF-8"
throughout, it means that code called with it has to be able to deal
with the representations for invalid UTF-8.

Currently Emacs uses code points above the Unicode range for
representing non-Unicode characters from different encodings, and it
uses the 2-byte overlong byte sequences for 0-127 to represent raw bytes
128-255.  That's not cast into stone but pretty efficient (I think
Python uses 3-byte surrogate sequences for raw bytes, somewhat worse)
and straightforward as it keeps the basic UTF-8 coding scheme invariants
intact.

Of course, all of this can be done simpler using an UCS-32
representation, but the basic tradeoffs leading to Emacs using a
variable-size multibyte representation are still valid in my opinion.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]