Re: [Lynx-dev] Unicode-marking, &c

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Unicode-marking, &c

From:	David Woolley
Subject:	Re: [Lynx-dev] Unicode-marking, &c
Date:	Fri, 27 Feb 2009 08:15:46 +0000
User-agent:	Thunderbird 2.0.0.19 (X11/20081209)

 wrote:

I saw queer little characters begin some webpages, and upon seeing such
on local webpages rendered right here, I suspect that they are magic numbers
that now mark in-Unicode-or-UTF8-encoded files, and Lynx misses this.

They are byte order marks and DO NOT indicate that the file is not ISO8859/1. You need to know that the file is UTF-* before you start tryingto interpret these codes.

Maybe when such really are downloaded it is the server s duty to strip the
page of the magic numbers and turn them into other forms of naming the file
s alphabet, but when the file is local Lynx is stuck with it.

It is the server's duty to send character set indications (which aremandatory in HTML4) that correctly represent the character encodingused. How they identify the character set used is a local issue, notone for standardisation.

What you are probably hitting is the tendency of big name browsers,particularly IE, to interpret pages as what they think the author meant,rather than what they have actually said. The most famous case of thisis probably that IE will interpret text/plain as HTML, if it looks likeHTML, even if the author's intent was that it be seen as the sourcecode. That is a direct violation of the standards that MS refuse to change.


Here under Windows there are constant references to the character that
begins a 16-bit-wide-character file (FF FE) or UTF-8 file (EF BB BF).

These are all valid printable characters in ISO 8859/x. Althoughsomewhat unlikely combinations, they are not reserved sequences.


Has anything been done about this?

It's a problem of sloppy authoring, much like the sending of GB2312 without a charset, or even with a windows-1252 one. In particular, ifneither the HTTP nor the meta content-type specify a charset, but theHTML version is 4 or higher, the page is invalid, and if they specifythe wrong charset, that charset is the one that Lynx should use,


--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Lynx-dev] Unicode-marking, &c, (continued)
- Re: [Lynx-dev] Unicode-marking, &c, David Woolley <=
  - Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser, 2009/02/27

Prev by Date: Re: [Lynx-dev] Unicode-marking, &c
Next by Date: Re: [Lynx-dev] Unicode-marking, &c
Previous by thread: Re: [Lynx-dev] Unicode-marking, &c
Next by thread: Re: [Lynx-dev] Unicode-marking, &c
Index(es):
- Date
- Thread