[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Auto-detection of windows-1252 fails
From: |
Reiner Steib |
Subject: |
Auto-detection of windows-1252 fails |
Date: |
Sat, 05 Jan 2008 14:22:37 +0100 |
User-agent: |
Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.50 (gnu/linux) |
Hi,
in September/October 2006 we had a long thread on emacs-pretest-bugs
about auto-detection of windows-1252 text files:
Subject: local chars displayed as numbers
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/>
[ I include a summary of this thread below. ]
windows-1252 files were supposed to be detected automatically in the
"Latin-1" and "German" language environments. This doesn't work
(anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk.
* Recipe to reproduce the problem:
$ echo -e '\x91 O:\xD6 o:\xF6 \x92' > w1252-O-o.txt
I.e. The file contains the following non-ascii characters:
- LEFT SINGLE QUOTATION MARK (U+2018)
- LATIN CAPITAL LETTER O WITH DIAERESIS (U+00D6)
- LATIN SMALL LETTER O WITH DIAERESIS (U+00F6)
- RIGHT SINGLE QUOTATION MARK (U+2019)
$ file w1252-O-o.txt
w1252-O-o.txt: Non-ISO extended-ASCII text
When decoded correctly, it looks like this:
,----[ w1252-O-o.txt ]
| ‘ O:Ö o:ö ’
`----
* Expected result:
According to the discussion in 2006, this file should be recognized
as windows-1252 with the following command lines:
$ LC_ALL=de_DE emacs -Q w1252-O-o.txt
$ emacs -Q --eval '(set-language-environment "German")' w1252-O-o.txt
* Current result:
The file is opened in iso-8859-1, i.e. the left quotation mark is
displayed as \221 and the right quotation mark is detected as
eight-bit-control:
,----[ M-x describe-char RET ]
| character: \222 (146, #o222, #x92, U+0092)
| charset: eight-bit-control (8-bit control code (0x80..0x9F))
| code point: #x92
| syntax: which means: whitespace
| buffer code: #x92
| file code: not encodable by coding system iso-latin-1-unix
| display: by display table entry [?'] (see below)
|
| The display table entry is displayed by these fonts (glyph codes):
| ': -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-1 (#x27)
`----
* Summary of the September/October 2006 discussion:
The following change was installed...
,----[ ChangeLog.12 ]
| 2006-09-21 Kenichi Handa <address@hidden>
|
| * language/european.el ("Latin-1"): Add windows-1252 to
| coding-priority.
| ("German"): Likewise.
`----
... and was supposed to result in the following behavior:
Kenichi Handa wrote in
<http://article.gmane.org/gmane.emacs.pretest.bugs/14384>:
| A file containing a windows-1252 char that doesn't appear in
| iso-8859-1 is detected as windows-1252. Bad effect is that some (or
| many) binary files are also detected as windows-1252.
Some people pointed out that this may lead to the bad effect that some
(or many) binary files are also detected as windows-1252. Eli
suggested to implement null-byte detection which should solve this
problem.
In <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14384>
Kenichi Handa wrote:
| Reiner Steib <reinersteib+gmane <at> imap.cc> writes:
|
| > (6) Implement null-byte detection (to prevent binary files
| > mis-detected as windows-12xx), keep the current code (windows-1252)
| > and add windows-1254/1255 accordingly.
|
| I think that change results in the best behavior.
... and Richard agreed on that. But I don't think this has been done.
("the current code" refers to the 2006-09-21 change, see above.)
In
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14367> I
attached 3 simple test files an described the result:
,----
| I did some tests with (see attached auto-coding.tar.gz)...
|
| (a) a file containing only windows-1252 characters,
|
| (b) a file with some Latin-1 text plus "reserved characters"
| (i.e. chars not defined in windows-1252),
|
| (c) a file with some Latin-1 and windows-1252 text plus a null-byte.
|
| Emacs detected the files as:
|
| (a) windows-1252 (-> correct)
|
| (b) raw-text-unix (-> correct)
|
| (c) windows-1252 (-> slightly incorrect, at least for people who argue
| that binary is better here)
`----
* Additionally, the addition of windows-1252 to "German" has been lost
in the emacs-unicode-2 branch:
--- european.el 26 Jul 2007 05:27:10 -0000 1.100
+++ european.el 25 Dec 2007 10:57:51 -0000 1.86.4.13
@@ -277,16 +414,15 @@
(set-language-info-alist
"German" '((tutorial . "TUTORIAL.de")
- (charset ascii latin-iso8859-1)
+ (charset iso-8859-1)
(coding-system iso-latin-1 iso-latin-9)
- (coding-priority iso-latin-1 windows-1252)
+ (coding-priority iso-latin-1)
+ (nonascii-translation . iso-8859-1)
(input-method . "german-postfix")
Bye, Reiner.
--
,,,
(o o)
---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/
- Auto-detection of windows-1252 fails,
Reiner Steib <=