emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs rewrite in a maintainable language


From: Stephen J. Turnbull
Subject: Re: Emacs rewrite in a maintainable language
Date: Mon, 19 Oct 2015 02:46:26 +0900

David Kastrup writes:

 > Personally I have no problem with an implementation insisting on
 > certain properties for its internal encoding.  But that implies
 > that "internal encoding" and "external UTF-8" may diverge when
 > "external UTF-8" does not exclusively contain valid UTF-8.

Then the external data shouldn't be called "UTF-8" in discussions like
this one.  The problem of data that is not valid for the presumed
encoding is not limited to UTF-8, Unicode, or even to text.  It just
happens that we have good solutions (not limited to ritual suicide)
for the text stream case.

Also, we should remember that Unicode is a wire protocol.  It's very
useful to adapt the formats defined by Unicode for constructing and
parsing internal and external data -- that can be very efficient.  But
we also need to have a strict-conformance option for I/O that is
declared to be Unicode, and that probably be the default.

 > However, if "internal encoding" is not the same as "valid UTF-8"
 > throughout, it means that code called with it has to be able to
 > deal with the representations for invalid UTF-8.

Emacs certainly can deal, since it has a 'binary' encoding and can
represent that internally.  But that's awfully inconvenient.
Something like Emacs's current implementation, Markus Kuhn's UTF-8b,
or Python's PEP 383 is really required for Emacs implementations.
(Does anybody remember that awful mail format of Win2k beta's version
of Outlook Express, where the HTML tags were encoded in ASCII and the
element content in little-endian UTF-16?)

 > [Emacs's internal text representation is] not cast into stone but
 > pretty efficient (I think Python uses 3-byte surrogate sequences
 > for raw bytes, somewhat worse)

No.  Python uses a wide-char representation.  In Python 2, it's 2
bytes on most non-glibc platforms, and 4 bytes on glibc.  In Python 3
with PEP 393 support, valid ISO-8859-1 text (even if decoded from
another external encoding) is represented in one byte, valid BMP text
(optionally with support for invalid "rawbytes", internally encoded as
lone trailing surrogates) in two bytes, and text containing characters
from the astral planes in four bytes (again with optional support for
invalid rawbytes).

 > and straightforward as it keeps the basic UTF-8 coding scheme
 > invariants intact.
 > 
 > Of course, all of this can be done simpler using an UCS-32
 > representation, but the basic tradeoffs leading to Emacs using a
 > variable-size multibyte representation are still valid in my
 > opinion.

Seems reasonable to me.  So far Python with PEP 393 has been pretty
successful, but since emoticons live in the astral planes, I suspect
it may not be the best representation for the web and phones -- one
smiley in ASCII text will quadruple the needed string storage.  I
don't see a good reason to change Emacs's representation at this
point.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]