bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18520: string ports should not have an encoding


From: Mark H Weaver
Subject: bug#18520: string ports should not have an encoding
Date: Wed, 24 Sep 2014 01:30:59 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

David Kastrup <address@hidden> writes:

> In Guile 2.0, at the time a string port is opened, the value of the
> fluid %default-port-encoding is used for deciding how to encode the
> string into a byte stream, [...]

I agree that this was a mistake.  The issue is fixed on the master
branch.

> Ports fundamentally deliver characters, and so reading and writing from
> a string source/sink should not involve _any_ coding system.

David, you know as well as I that internally, there is always a coding
system.  Strings have a coding system too, even if it's UCS-4.  Emacs
uses something based on UTF-8, and I'd like to Guile to do something
similar in the future.

I guess you don't like the fact that it is possible to expose the
internal representation via 'set-port-encoding!', 'ftell' or 'seek'.
I don't see this as a problem, and arguably it's a benefit.

First I'll address the non-standard 'set-port-encoding!'.  As you say,
it doesn't even make sense on string ports, and arguably should be an
error.  So why do you care if some internal details leak out when you do
this nonsensical thing?  Admittedly, we're missing an opportunity to
report a possible bug to the user, but that's the only problem I see
here.

Regarding 'ftell' and 'seek', it's not entirely clear to me what's the
best representation of those positions.  In some situations, I guess it
would be convenient for them to count unicode code points or string
indices.  In other situations, I could imagine it being more convenient
for them to count grapheme clusters or UTF-8 bytes.

R6RS, the only Scheme standard that supports getting or setting file
positions, gives us complete freedom to choose our representation of
positions on textual ports.  The R6RS is explicit that they don't even
have to be integers, and if they are, they don't have to correspond to
bytes or characters.

For better or for worse, Guile's ports are fundamentally based on bytes,
and allow mixed binary and textual operations on all ports.  Sometimes
this is very helpful, for example when implementing HTTP.  I can think
of one other case where it's very helpful:

I don't know how deeply you've looked at UTF-8, but it has some unusual
properties that allow many (most?) string algorithms to be most
naturally (and efficiently) implemented by operating on bytes rather
than code points.  Much of the time, you don't even have to be aware of
the code point boundaries, which is a great savings.  Efficient lookup
tables based on bytes are also much cheaper than ones based on code
points, etc.

In fact, I intend to propose that in a future version of Guile, strings
will not only be based on UTF-8 internally, but that this fact should be
exposed in the API, allowing users to implement UTF-8 string operations
that operate on bytes not code points.  I'd also like lightweight, fast
string ports that allow access to these bytes when desired.

This leads me to believe that it's a feature, not a bug, that string
ports use UTF-8 internally, and that it's possible (via non-standard
extensions) to get access to the underlying bytes.

      Mark





reply via email to

[Prev in Thread] Current Thread [Next in Thread]