[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: byte-order marks
From: |
Mark H Weaver |
Subject: |
Re: byte-order marks |
Date: |
Tue, 29 Jan 2013 14:09:16 -0500 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux) |
I wrote:
> Having slept on this, I think I agree that 'open-input-file' should
> auto-consume BOMs.
On the other hand, there's a nasty complication. Of course
(open-input-file FILENAME) is just (open-file FILENAME "r"), so the
auto-consuming logic should be in 'open-file'.
So what should (open-file FILENAME "r+") do? The problem is that we
don't know if the user will read or write first. If they write first,
then they may reasonably assume that what they write will be put at the
very beginning of the file, no?
Also, Unicode 6.2 section 2.6 table 2-4 says that BOMs are only allowed
for the encoding schemes UTF-8, UTF-16, and UTF-32. They are *not*
allowed for UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
Unicode 6.2 section 16.8 goes into more detail:
For compatibility with versions of the Unicode Standard prior to
Version 3.2, the code point U+FEFF has the word-joining semantics of
zero width no-break space when it is not used as a BOM. [...]
Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE, then all U+FEFF characters -- even at the very beginning of
the text -- are to be interpreted as zero width no-break spaces.
Similarly, where Unicode text has known byte order, initial U+FEFF
characters are not required, but for backward compatibility are to be
interpreted as zero width no-break spaces. [...]
Systems that use the byte order mark must recognize when an initial
U+FEFF signals the byte order. In those cases, it is not part of the
textual content and should be removed before processing, because
otherwise it may be mistaken for a legitimate zero width no-break
space. To represent an initial U+FEFF zero width no-break space in a
UTF-16 file, use U+FEFF twice in a row. The first one is a byte order
mark; the second one is the initial zero width no-break space. [...]
This will require some more research and thought.
Mark
- byte-order marks, Andy Wingo, 2013/01/28
- Re: byte-order marks, Mike Gran, 2013/01/28
- Re: byte-order marks, Mark H Weaver, 2013/01/29
- Re: byte-order marks, Andy Wingo, 2013/01/29
- Re: byte-order marks, Ludovic Courtès, 2013/01/29
- Re: byte-order marks, Andy Wingo, 2013/01/30
- Re: byte-order marks, Ludovic Courtès, 2013/01/30
- Re: byte-order marks, Andy Wingo, 2013/01/31
- [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Mark H Weaver, 2013/01/30
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Andy Wingo, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Andy Wingo, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings, Mark H Weaver, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings, Ludovic Courtès, 2013/01/31