emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#23750: 25.0.95; bug in url-retrieve or json.el


From: Eli Zaretskii
Subject: Re: bug#23750: 25.0.95; bug in url-retrieve or json.el
Date: Fri, 02 Dec 2016 17:45:02 +0200

> From: Yuri Khan <address@hidden>
> Date: Fri, 2 Dec 2016 21:53:16 +0700
> Cc: Dmitry Gutov <address@hidden>, Philipp Stephani <address@hidden>, 
> It is really unfortunate that we talk about ASCII strings, unibyte
> strings, multibyte strings, as if that was a meaningful
> classification.

It is meaningful when you work on Emacs code.

> The real dichotomy is between text (aka strings) and MIME-type-tagged
> byte arrays.

That might be so in the context of HTTP, but in general, byte arrays
("raw bytes" in Emacs parlance) are not limited to MIME types.
Moreover, there are very frequent use cases where Emacs code needs to
work with a byte array whose type is unknown, or even cannot be known
at all, because it doesn't come with any meta-data of any kind.

> In order to send a string over HTTP, one must encode it
> to a byte array and tag it as "text/plain; charset=utf-8" or
> "text/html; charset=utf-8" or application/json (no charset parameter
> because json must always be encoded in one of utf-* for transmission).
> Conversely, a byte array received over HTTP can, MIME type allowing,
> decoded into a string.
> 
> The fact that there exist strings for which encoding and decoding are
> identity transforms should be regarded only as an implementation
> detail.

You are talking generalities here, whereas this discussion is about
Emacs-specific internal issues.  In Emacs, a plain-ASCII string is
indistinguishable from a "byte array" whose bytes are all below 128.
They have the same representation.  To muddy the water even more, a
plain-ASCII string can be "marked" as multibyte (again, internally),
but it should be clear that such a "mark" has no meaning at all for
ASCII text.

>From the Lisp application POV, whether a plain-ASCII string it
receives or processes is marked as unibyte or multibyte is entirely
random.  So if some ASCII text is accepted by an Emacs API involved in
sending HTTP requests, while an identical ASCII string is rejected,
it could be a source of surprises and bug reports.

That is the core of the issues discussed here.

> Attempts by libraries and frameworks to silently DTRT for this
> subset lead to applications neglecting to properly encode or tag
> strings, leading, in turn, to breakage in presence of multilingual
> text.

Based on Emacs experience of dealing with multibyte text and its
encoding/decoding, the conclusion was that it is better to silently
DTRT where we can be sure we know how.  Making a point of educating
users by harsh measures such as signaling errors where Emacs could
easily proceed, is generally not welcome.  We will see if this case is
any different.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]