chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.


From: Peter Bex
Subject: Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date: Tue, 15 Jan 2013 11:48:56 +0100
User-agent: Mutt/1.4.2.3i

On Tue, Jan 15, 2013 at 07:30:07PM +0900, Alex Shinn wrote:
> Right, I'm familiar with the evil standards :)  I'm also hoping that we can
> have some basic compatibility between Chicken's uri module and Chibi's
> (and whatever R7RS WG2 comes up with).

That would be nice indeed.

> It seems to me the sane thing to do is represent URIs unencoded
> internally, which can be generated directly with make-uri or decoded
> on parsing.

That cannot be done in general.  If you decode something like %2F, that
will wreak havoc with path-structured URIs.  The same will happen with
other types of "special" characters; you need to be able to distinguish
between the "special" character as-is and encoded.

These special characters are called "reserved" in the BNF.  As you can
see, the question mark, equals sign and ampersand is in there.
For query urlencoded query strings, these *cannot* be decoded, because
then you can't distinguish between

http://calc.example.com?bool-expr=x%26y%3D
and 
http://calc.example.com?bool-expr=x&y=1

The former should be decoded in uri-common to the alist
((bool-expr . "x&y=1")) and the latter to ((bool-expr . "x") (y . "1")).
By fully decoding all reserved characters in uri-generic, you drop
important information.

All unreserved characters are already fully decoded by uri-generic,
but this leaves the extra decoding of things like the ampersand above
inside the query string components after form-decoding to be done by
uri-common.

> The decoding might be schema-specific, although
> really the only difference is the space-to-+ and query args encoding.

No, the conversion to a friendly alist is specific to uri-common.

> I was confused because the uri-generic change Ivan suggests
> seems to be putting encoded characters directly in the representation,
> whereas uri-common is encoding only on output.

I don't understand this either.  I'm at work, so maybe it's just due to
a lack of complete attention.

> [It also looks like the uri-common encoding is broken - why were bytes
> getting lost?]

Probably because it doesn't correctly deal with UTF-8 in the decoding of
URLencoded form data.  I'll need a proper test case and some time to
look into it.

> Finally, regarding parsing I still don't understand why %AB is decoded
> into the corresponding octet but %AB%CD is not?

Unsure.

Cheers,
Peter
-- 
http://sjamaan.ath.cx



reply via email to

[Prev in Thread] Current Thread [Next in Thread]