|
From: | Ivan Raikov |
Subject: | Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri. |
Date: | Tue, 15 Jan 2013 15:14:30 +0900 |
Hi Alex,I understand your point about make-uri, but I want to provide a uri constructor that takes a UTF-8 input string and maps it in accordance with RFC 3986 / 3987.
So we still have to perform path and percent-encoding normalization steps for the ASCII portions of the string. make-uri makes no such attempts at normalization and so does not strictly follow RFC 3986.
I interpreted Section 3.1 from RFC 3987 to mean that UTF-8 are encoded by taking each octet and applying percent encoding on it.
So for the string "пиле" the octets are D0 BF D0 B8 D0 BB D0 B5 and (utf8-string->uri "http://example.com/пиле") produces
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)
For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83 95 and (utf8-string->uri "http://example.com/삼계탕") produces
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)
Can you elaborate what is broken about this? Perhaps I do not understand UTF-8 and need to apply a bitmask or something to the octets?Percent-encoded sequences of more than one octet will not get touched by pct-decode in the current implementation, so you will not get double escaping. Percent-encoded sequences of one octet will get decoded if they fall in the "unstructured" char-set, as per RFC 3986.
IvanThis result looks broken. As I noted in my previous mail, the URI representationalready handles non-ASCII characters and escapes on output:$ csi -R uri-common#;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))#<URI-common: scheme="http" port=#f host="127.0.0.1" path=(/ "삼계탕") query=#f fragment=#f>#;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕")))If you put percent escapes _inside_ the internal path representation,you'll get double escaping.Parsing is a separate matter, and utf8-string->uri should returnthe URI object without error, but with the unescaped values inthe path and query as resulting from the make-uri above.Unrelated, the actual escaped output looks buggy - it looks likesome characters like the leading "%EC%" are getting dropped.--Alex#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
[Prev in Thread] | Current Thread | [Next in Thread] |