bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18520: string ports should not have an encoding


From: David Kastrup
Subject: bug#18520: string ports should not have an encoding
Date: Wed, 24 Sep 2014 14:00:38 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Mark H Weaver <address@hidden> writes:

> David Kastrup <address@hidden> writes:
>
>> In Guile 2.0, at the time a string port is opened, the value of the
>> fluid %default-port-encoding is used for deciding how to encode the
>> string into a byte stream, [...]
>
> I agree that this was a mistake.  The issue is fixed on the master
> branch.

The mistake is having a string port use a different
sequence-of-character encoding than a string.

>> Ports fundamentally deliver characters, and so reading and writing
>> from a string source/sink should not involve _any_ coding system.
>
> David, you know as well as I that internally, there is always a coding
> system.  Strings have a coding system too, even if it's UCS-4.  Emacs
> uses something based on UTF-8, and I'd like to Guile to do something
> similar in the future.
>
> I guess you don't like the fact that it is possible to expose the
> internal representation via 'set-port-encoding!', 'ftell' or 'seek'.
> I don't see this as a problem, and arguably it's a benefit.

Shrug.  That arguable benefit went down in flames in Emacs 20.  It
triggered the last great migration from Emacs users to XEmacs.  It took
until Emacs 20.4 until the horrible mistake of exposing byte offsets to
the user in either strings or buffers was corrected.

You write above "Emacs uses something based on UTF-8", and it's worth
pointing out that it does so starting with Emacs 23.  Previously Emacs
used its own peculiar multibyte encoding that existed long before UTF-8.
The important thing to note is that is was _completely_ hidden from
sight from Elisp users when the Emacs 20 tribulations were over.  Emacs
was able to swap out this multibyte encoding for the Emacs 23 coding
rather transparently, and the main reason to do that was to make UTF-8 a
favored encoding regarding performance of encoding/decoding and
processing of Elisp source files.

Emacs' internal encoding is not proper UTF-8.  You can take a random
byte string, tell Emacs that it is encoded in UTF-8, and decode it into
Emacs' internal representation.  All passages that happen to be proper
uniquely represented UTF-8 will pass the transcoding unchanged, but
everything else will be transcoded into a UTF-8-like representation of
"unencodable byte".  I think Emacs uses the UTF-8 forbidden code points
from 0xd800 to 0xd880 for encoding stray bytes, or something like that.

So if you reencode the unchanged "UTF-8" Emacs uses internally, the
result will again faithfully reproduce the random byte stream.

Garbage in, _same_ garbage out.  A very important property that many of
Emacs' supported file encodings share.  Notable exception are various
Japanese encodings based on escape characters.

At any rate, unless you are using explicit conversions like
string-as-unibyte or _encoding_ to Emacs' internal representation (it is
available as a named coding system), the representation is not exposed.
Strings are indexed per character, and buffers (which are at their heart
random-access string ports) are indexed per character.

Emacs has both unibyte and multibyte strings and unibyte and multibyte
buffers, and unibyte strings and buffers are the source for decoding and
the target for encoding into multibyte strings and buffers.  XEmacs does
not have unibyte strings/buffers, so a lot of string internals do not
need to make the distinction.  GUILE could probably get away without
unibyte strings as well because it has bytevectors.  This would imply
that if you wanted to do stuff akin to string operations on unibyte
strings, you'd have to first convert bytevectors to multibyte strings,
do your operations, convert back.  XEmacs chose _not_ to have unibyte
strings (and the corresponding complications to support both in the
primitives), Emacs chose to have them.  I think both approaches are
defensible.

Since GUILE presents itself as an extension language and since strings
will need to get passed in and out of extension languages all the time,
the implementation cost of offering a low-cost unibyte string is
probably even more defensible than with Elisp where Elisp is the main
processing language.

> First I'll address the non-standard 'set-port-encoding!'.  As you say,
> it doesn't even make sense on string ports, and arguably should be an
> error.  So why do you care if some internal details leak out when you
> do this nonsensical thing?  Admittedly, we're missing an opportunity
> to report a possible bug to the user, but that's the only problem I
> see here.
>
> Regarding 'ftell' and 'seek', it's not entirely clear to me what's the
> best representation of those positions.  In some situations, I guess
> it would be convenient for them to count unicode code points or string
> indices.  In other situations, I could imagine it being more
> convenient for them to count grapheme clusters or UTF-8 bytes.
>
> R6RS, the only Scheme standard that supports getting or setting file
> positions, gives us complete freedom to choose our representation of
> positions on textual ports.  The R6RS is explicit that they don't even
> have to be integers, and if they are, they don't have to correspond to
> bytes or characters.

R6RS gives you the freedom to match your semantics to your
implementation.  String ports are strings-in-progress (and Emacs buffers
are strings-in-progress on steroids), so it makes sense to match the
fseek/ftell semantics of string ports to those of strings and the
implementation to those of strings.  You don't have anything to gain
from converting characters to bytes and back just because you can.

> For better or for worse, Guile's ports are fundamentally based on
> bytes,

Seriously?  The whole point of this issue was that fundamentally basing
GUILE's string ports on bytes is for worse.

> and allow mixed binary and textual operations on all ports.

I'll go out on a limb here and state "they don't".  They work with bytes
(either located on file or in some internally generated or consumed byte
vector) and they input/output characters on their Scheme side, and you
can change the en/decoding system which which characters are put into
the stream or consumed.  Their external side is identical to its
internal side, and the Scheme/character/string side is fundamentally
different.  By changing the port encoding, you can change the conversion
between Scheme on the one side and internal/external on the other.  All
operations are binary on the internal side, and textual on the Scheme
side.  That there are encodings which are less costly does not
fundamentally change this.

> Sometimes this is very helpful, for example when implementing HTTP.  I
> can think of one other case where it's very helpful:
>
> I don't know how deeply you've looked at UTF-8,

It is a somewhat safe bet that a person who is the head maintainer of an
application conversing in UTF-8 while using GUILE-1.8 in its internals
has had some basic amount of exposure to UTF-8.  In general, the working
assumption "David just has little clue about computing" is rarely
helpful for dismissing matters since David tends to have picked up
tidbits occasionally since he started computing on systems where
lowercase letters already needed a multi-sextet representation in its
60bit words.

So it is a reasonably safe bet that when David has some problems with
matters, chances are that a non-negligible percentage of other users
will not fare significantly better, so it is a somewhat relevant
indicator what to avoid.

> but it has some unusual properties that allow many (most?) string
> algorithms to be most naturally (and efficiently) implemented by
> operating on bytes rather than code points.  Much of the time, you
> don't even have to be aware of the code point boundaries, which is a
> great savings.  Efficient lookup tables based on bytes are also much
> cheaper than ones based on code points, etc.

That's all very nice but totally irrelevant for this issue.  If you like
UTF-8, by all means base the internal string representation of GUILE on
it.  It comes at a cost since strings in Scheme are writable (and there
are more operations for doing so than in Elisp) and indexed by
character.  Emacs has paid this cost: I think the basic speed of Emacs
dropped by a factor of 2 when indexing was moved from bytes to
characters around Emacs 20.2 or similar.

But this issue is about not using different internal coding and exposed
interfaces for strings and string ports.  Whatever internal string
representation you choose, it does not make sense to pick a different
representation and indexing for string ports.

> In fact, I intend to propose that in a future version of Guile,
> strings will not only be based on UTF-8 internally, but that this fact
> should be exposed in the API, allowing users to implement UTF-8 string
> operations that operate on bytes not code points.

This experiment has been tried and crashed and burnt with the initial
MULE versions in Emacs 20.  Current versions _do_ offer conversion-less
reinterpretations string-as-unibyte and string-as-multibyte and offer
working with either string type.  As explained, that comes at the cost
of having to make all primitives able to work with either.  They are
actually rarely used by application level programmers, so most
applications do not have this as a porting problem between Emacs and
XEmacs (XEmacs has only multibyte strings).

Personally, I'd consider that worth the cost in the case of GUILE.
While XEmacs gets along without this addition, it seems important for
efficient passing of data in and out of GUILE.  It would also make sense
to distinguish between multibyte (internal form of UTF-8, anything may
happen if it is not properly formed) and external UTF-8 (reading/writing
it uses a conversion process turning all illegal UTF-8 bytes into some
reproducible representation).

> I'd also like lightweight, fast string ports that allow access to
> these bytes when desired.

Any string port that does not involve encoding/decoding will be
lightweight and fast, lighter and faster than any implementation having
to code/decode gratuitously.  Which is one of the points of this issue,
even though I am more concerned with the conceptual cost than the
runtime cost.  But both have an impact.

> This leads me to believe that it's a feature, not a bug, that string
> ports use UTF-8 internally, and that it's possible (via non-standard
> extensions) to get access to the underlying bytes.

Getting confused about bytes and characters and introducing unnecessary
conversions is not a feature.  Even if you at one time use an UTF-8
based string representation, working with external UTF-8 will involve
encoding/decoding processes.  Forcing a string port to encode/decode
during operation will remain expensive.  Exposing string internals
beyond quite special-purpose functions will be hard to deal with.

All those lessons have already been learnt with Emacs.  If you want to
relearn them from scratch, the available developer power will not make
basing Emacs on GUILE realistic in the next 10 years: Emacs
fundamentally operates with texts.  Too many reliability or efficiency
problems doing that (or having to implement them as foreign datatypes
altogether) will not make Guilemacs acceptable.

So even in cases where multiple strategies are feasible, it may make
sense to lean towards Emacs' choices.  One choice that has served Emacs
well is to hide its internal encoding system well from the external
ones.  That way its switch to an internal coding system based on UTF-8
affected almost no existing Elisp packages, and the programming model
was conceptually clean.

-- 
David Kastrup





reply via email to

[Prev in Thread] Current Thread [Next in Thread]