emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 23.0.60; [nxml] BOM and utf-8


From: Stephen J. Turnbull
Subject: Re: 23.0.60; [nxml] BOM and utf-8
Date: Fri, 23 May 2008 02:34:46 +0900

address@hidden writes:

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
 > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
 > > > | begin with the Byte Order Mark [...]
 > > > |        [...]  XML processors MUST be able to use this character to
 > > > | differentiate between UTF-8 and UTF-16 encoded documents.
 > > > `----
 > 
 > ...and how are the XML processors supposed to achieve that? Is there a
 > second variant of BOM, indicating UTF-8?

Well, note that the BOM is three octets in UTF-8 but only two in
UTF-16.  Dead giveaway, there.

 > > > ,----[ 
 > > > http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
 > > > | If an XML entity is in a file, the Byte-Order Mark and encoding
 > > > | declaration are used (if present) to determine the character encoding.
 > > > `----
 > 
 > ...or is rather the absence of a BOM the indicator for UTF-8?

Absence of a BOM is *an* indicator for UTF-8, as is presence of the
BOM encoded in UTF-8 as the first 3 octets of a file.

 > Am I completely whacko, or are they?

Neither.  You live in a relatively sane world, they live in a world
which contains the sovereign nations of Japan and Microsoft.

 > But that would be "in the middle of a file", not at the beginning, as
 > our case is.
 > 
 > I'd appreciate any insights.

If there is a higher level protocol which tells you what to do about
the BOM, obey it.  This is the case for XML files.

Otherwise, if U+FEFF occurs at the beginning of a file which otherwise
seems to be valid Unicode (in two-octet and four-octet versions, that
means containing no instances of 0xFFFF and only one endianness of
0xFEFF, in UTF-8, doesn't violate the constraints of UTF-8, and
doesn't contain any sequences that decode to U+FFFF or U+FFFE), ignore
it and start processing with the next character.

The next question is, where is this "XML processing" done?  As David
Kastrup points out, Emacs must be able to pass the BOM through to the
buffer, and he may be correct that that is the best default behavior.
I don't see any way for Emacs to determine when pass through is
appropriate and when not in the coding system; the coding system is
normally invoked at a level where Emacs cannot know that it will be
processed by nxml.

Therefore I think that for the purposes of XML conformance, nxml, not
Emacs, must be considered the XML processor, and nxml's failure to
recognize the BOM and ignore it (for the purpose of checking validity)
is a bug.

However, I'd be careful.  Maybe somebody should ask James Clark why he
did things this way.  He may have had an excellent reason for
insisting that Emacs strip BOMs before passing the file on to nxml.

Or maybe he just never saw UTF-8 signatures, or maybe he never edits
binary files using text encodings, and didn't consider use cases other
than editing text files important enough to provide for the BOM in
nxml.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]