[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] Unicode-marking, &c
From: |
David Woolley |
Subject: |
Re: [Lynx-dev] Unicode-marking, &c |
Date: |
Fri, 27 Feb 2009 08:15:46 +0000 |
User-agent: |
Thunderbird 2.0.0.19 (X11/20081209) |
wrote:
I saw queer little characters begin some webpages, and upon seeing such
on local webpages rendered right here, I suspect that they are magic numbers
that now mark in-Unicode-or-UTF8-encoded files, and Lynx misses this.
They are byte order marks and DO NOT indicate that the file is not ISO
8859/1. You need to know that the file is UTF-* before you start trying
to interpret these codes.
Maybe when such really are downloaded it is the server s duty to strip the
page of the magic numbers and turn them into other forms of naming the file
s alphabet, but when the file is local Lynx is stuck with it.
It is the server's duty to send character set indications (which are
mandatory in HTML4) that correctly represent the character encoding
used. How they identify the character set used is a local issue, not
one for standardisation.
What you are probably hitting is the tendency of big name browsers,
particularly IE, to interpret pages as what they think the author meant,
rather than what they have actually said. The most famous case of this
is probably that IE will interpret text/plain as HTML, if it looks like
HTML, even if the author's intent was that it be seen as the source
code. That is a direct violation of the standards that MS refuse to change.
Here under Windows there are constant references to the character that
begins a 16-bit-wide-character file (FF FE) or UTF-8 file (EF BB BF).
These are all valid printable characters in ISO 8859/x. Although
somewhat unlikely combinations, they are not reserved sequences.
Has anything been done about this?
It's a problem of sloppy authoring, much like the sending of GB2312 with
out a charset, or even with a windows-1252 one. In particular, if
neither the HTTP nor the meta content-type specify a charset, but the
HTML version is 4 or higher, the page is invalid, and if they specify
the wrong charset, that charset is the one that Lynx should use,
--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.
- Re: [Lynx-dev] Unicode-marking, &c, (continued)
- Re: [Lynx-dev] Unicode-marking, &c, Thomas Dickey, 2009/02/26
- Re: [Lynx-dev] Unicode-marking, &c, David Woolley, 2009/02/27
- Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser, 2009/02/27
- Re: [Lynx-dev] Unicode-marking, &c, Thomas Dickey, 2009/02/27
- Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser, 2009/02/27
- Re: [Lynx-dev] Unicode-marking, &c, Halász Sándor, 2009/02/26
- Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser, 2009/02/27
- Message not available
- Re: [Lynx-dev] Unicode-marking, &c, Thomas Dickey, 2009/02/26
- Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser, 2009/02/27
- Re: [Lynx-dev] Unicode-marking, &c, Thomas Dickey, 2009/02/27
Re: [Lynx-dev] Unicode-marking, &c,
David Woolley <=