Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

From:

Ivan Raikov

Subject:

Date:

Wed, 23 Jan 2013 15:29:01 +0900

Hi Peter,

I think uri-generic does not silently mangle input upon receiving UTF-8, it just returns #f. I think it is not a bad idea to raise an exception instead.

I have not yet had the chance to thoroughly test the UTF-8 mapping constructor, but will try to do this during the weekend.

Ivan

On Thu, Jan 17, 2013 at 5:45 PM, Peter Bex <address@hidden> wrote:

On Thu, Jan 17, 2013 at 09:35:36AM +0900, Ivan Raikov wrote:
> Hi Peter,
>
> I think that allowing raw UTF-8 sequences in uri-generic breaks
> compatibility with RFC 3986. In other words, if you construct a URI with a
> UTF-8 sequence that happens to include reserved ASCII characters, those
> ASCII characters will not get escaped, and you could potentially be sending
> an invalid URI to a legacy system that does not understand UTF-8.

Hi Ivan,

I agree with your assessment, but the way it currently silently mangles
input isn't ideal. I think it would be good if all constructors raised
an exception when receiving octets with the high bit set (this is
non-ASCII, which means it falls outside the scope of RFC 3986 so it's
acceptable to raise an exception). What are your thoughts on this?
If we do this, of course the error message should include a pointer to
the new UTF conversion API so people know what to do.

> My proposed solution is to include a UTF-8 aware constructor to
> uri-generic and prevent percent decoding of UTF-8 sequences. I believe that
> this solution is compatible with the IRI to URI mapping scheme described in
> Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test
> suite with more UTF-8 examples to ensure that nothing is broken. I think
> that any solution will have to give the user choice whether to use ASCII or
> UTF-8, and not just default to UTF-8.

This seems like a good compromise. Unfortunately it means the API will
grow quite a bit and make it less easy to use. I'll need to consider
what to do with http-client's "implicit" URI conversion though
(it accepts either strings or URI objects). I guess for now I'll keep
it the way it is. If people need UTF8 they can use the new conversion
procedures. Maybe later I can change it, this should not cause any
breakage (unless talking to legacy systems, but those don't accept UTF
anyway so if you have UTF-8 input, there's a problem anyway)

Cheers,
Peter
--
http://sjamaan.ath.cx

_______________________________________________
Chicken-users mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-users