bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30076: [PATCH] web: Recognize JSON content type as text.


From: Mark H Weaver
Subject: bug#30076: [PATCH] web: Recognize JSON content type as text.
Date: Tue, 30 Jan 2018 22:31:04 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)

Hi Arun,

Arun Isaac <address@hidden> writes:
> * module/web/response.scm (text-content-type?): Recognize JSON content
>   type as text.

While this would seem reasonable at first glance, it seems to me that
this will result in JSON texts with non-ASCII characters being
mishandled in many cases.

Within Guile, 'text-content-type?' is currently used in two places:

* 'decode-response-body' in (web client), and
* 'response-body-port' in (web response).

In both places, if 'text-content-type?' returns true, the encoding of
the response is assumed to be "ISO-8859-1" if not otherwise specified by
an explicit 'charset' parameter.  This is what RFC 2616 specifies for
text/plain, although RFC 6657 would change the default to US-ASCII, as
it was in RFC 2046, and maybe we should look into that.

However, things are quite different for the application/json MIME type,
as specified in RFCs 4627 and 7159.  Those RFCs specify that JSON text
"SHALL" (i.e. MUST) be encoded in Unicode (UTF-8, UTF-16 or UTF-32),
that the default encoding is UTF-8, and furthermore that no charset
parameter is defined for application/json.

So, we can expect at least some conforming implementations to omit the
'charset' parameter, and yet in that case we must assume that the
encoding is Unicode, and most definitely not ISO-8859-1.

RFC 4627 makes the additional interesting observation (in section 3,
"encoding") that since the first two characters of JSON text will always
be ASCII, and since UTF-8/UTF-16/UTF-32 are the only valid encodings for
JSON text, we can reliably determine the encoding by looking at the
pattern of nul bytes in the first four octets:

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Given that any of these encodings above are possible, and that there is
no 'charset' parameter defined for "application/json", it seems to me
that we have no choice but to be prepared to auto-detect the encoding,
as described in RFC 4627 section 3 if the 'charset' parameter is
missing.

What do you think?

      Mark





reply via email to

[Prev in Thread] Current Thread [Next in Thread]