bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Save 3 byte utf8 url


From: L Walsh
Subject: Re: [Bug-wget] Save 3 byte utf8 url
Date: Sat, 16 Feb 2013 17:02:14 -0800
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24 Mnenhy/0.7.6.666



Ángel González wrote:

    Or can it not do UTF-8 at all?

latin1 is going the way of the dodo...most sites still use it, but
HTML5 is supposed to be UTF8..
http://www.whatwg.org/specs/web-apps/current-work/#urls refers to http://url.spec.whatwg.org/ and it does set the encoding by default to utf-8. But I think it refers to /encoding/ a character, not to figure out which encoding was used in a url.
---
        Aren't url's, usually referenced by getting them from
within a webpage?  So, _if_ the source of the webpage was UTF-8
encoded, wouldn't the url's also be encoded that way?

        I notice in FF, I can choose the messed up version or the
real version in 'about:config' with the settings:
network.standard-url.encode.query-utf8    (default is false, but I set it to 
TRUE
I have yet to encounter a website that DOESN'T understand UTF-8).

and the other -- (and this is the one that gives you real vs. %%):
network.standard-url.escape-utf8   (default=true meaning do %%), but
changing that to false will send utf8 'over the wire' (and change what
you 'Copy', if you copy the url from the address bar.


Example:  With the latter setting in default, if I type in
http://www.last.fm/music/梶浦由記

I'll get taken to a page where the addr-bar LOOKS that way
(assuming the 1st setting, above, is TRUE),  but if I try to
cut/paste, I get 
"http://www.last.fm/music/%E6%A2%B6%E6%B5%A6%E7%94%B1%E8%A8%98";.

However, if I have the 2nd setting in non-default, 'FALSE' (meaning
don't encode UTF-8 as %%), then going to that page, and CUT/Paste
gives me: http://www.last.fm/music/梶浦由記.

If I save that page from my browser on windows 7 ...
the file is saved correctly (as viewed from either Explorer,
or a 'Cygwin X11 Window (like Terminal).  But if I view it
from an old DOS-compat-style window like the one that comes up with
'cmd.exe' -- there I get '????' as it can't display UTF-8.


Unfortunately,  I know of no native Microsoft win32-cmd line program
that will display the chars correctly even though you CAN set the
terminal / MS-console for UTF8 with
mode[.com]  con: cp select=65001, but MS's driver for codepage
65001 is (IMO) deliberately broken to prevent people from
using UTF-8 (which was the chosen standard for Unicode, over MS's
preferred UCS-2 solution (which they often, "rebrand",  usually
falsely) as UTF-16.  A large number of their legacy programs that
don't natively understand UTF-8 don't work beyond code page 1 --
i.e. UCS-2 compatible -- only 16-bits for the character.

Most don't *really* handle UTF-16, which takes 2 16-bit characters
to handle the full Unicode standard.




We could assume it's the same charset as the document, but what to do with documents with no charset (by wrong configuration, or for being scripts, images...) ?
---
        User choice or option?  -- I think you are supposed to
try a utf-8 decode on the object first, as if the document ISN'T
UTF-8, it will fail, but the reverse is not true -- if you try to
decode as latin1, all codes from 0x20-0xff are valid display codes,
so the decode algorithm can't fail.  But with UTF-8, any char
over 0x7f, has to be a 2 byte sequence where both should have
the high bit set.  All of the UTF-8 'continuation' bytes have 0b10 in
conforming (standard) UTF-8.


Seems easier to treat as utf-8 if it contains utf-8 sequences. That still needs a transformation of filenames, though.
---
        On linux, if their locale is UTF-8, then not.  or even on
Windows under cygwin -- if their locale is UTF-8, then not.  But if
they have an 8-bit locale -- you'd have to use %encoding to get
everything.  No guarantees that the UTF-8 filenames they download
can be recoded into any 8-bit character set.  But on windows --
I'd decode to UTF-16 and use that -- since at least the filename
will look correct if they browse it in a desktop application
or if they use an X11 Terminal like they could from the cygwin
collection....




If it found "González" on a file would it be able to save it correctly?

wget is always able to download the urls, the only difference is if they "look nice" in your system
---
Or if they can be saved at all -- some google addresses are > the filename
length.


A url like http://example.org/González in utf-8 would be encoded as http://example.org/Gonz%c3%a1lez so wget would think those are the characters à (0xC3) and ¡ (0xA1), saving it "as is". So if my filenames are utf-8 (eg. Linux) I will see it as González, if they are latin1 (eg. Windows, using windows-1252) I will see it as González.
----
        Oh joy!  (*sigh*)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]