emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: address@hidden: Coding problem with Euro sign]


From: Kevin Rodgers
Subject: Re: address@hidden: Coding problem with Euro sign]
Date: Fri, 16 Dec 2005 15:58:22 -0700
User-agent: Mozilla Thunderbird 0.9 (X11/20041105)

Ralf Angeli wrote:
> Currently I am on GNU/Linux.  Anyway, with the development version of
> Emacs I did not have the problems with cp1252 you described when
> loading the file.  But when trying to write the file I got this
> warning:
>
> ,----
> | Warning (:warning): Invalid coding system `cp1252' is specified
> | for the current buffer/file by the variable `auto-coding-regexp-alist'.
> | It is highly recommended to fix it before writing to a file.
> `----
>
> I didn't do `M-x codepage-setup RET' before trying all of this.
> Interestingly loading and writing the file worked fine if I used
> windows-1252 instead of cp1252.

Well, there you go.  Emacs 22.0 supports windows-1252, and Emacs 21.4
only supports cp850.

> * Kevin Rodgers (2005-12-15) writes:
>>One other detail: that entry only sets the coding system if the euro
>>is immediately preceded by an ASCII character.  Is that the case in
>>your file?
>
> No.  On emacs-pretest-bug I already explained that the original (test)
> file doesn't include the A circumflex, that means the euro is preceded
> by a newline.  (Maybe it would be better to continue the discussion in
> the thread on emacs-pretest-bug in order to avoid repetition?)

Ah.  The regexp only matched the [\200-\237] characters after a
non-control ASCII character.  So [\040-\177] needs to be expanded, at
least to [\t\n\r\040-\177] to include tab and newline sequences, but
maybe [\t\n\r\v\f\040-\177] to include vertical tab and formfeed, or
even [\000-\177] to include all ASCII characters.

(I don't subscribe to emacs-pretest-bug, I read the gnu.emacs.devel
newsgroup on gmane.org, which is gatewayed to and from the
address@hidden mailing list.  If you followed up to both mailing
lists/newsgroups that should solve the problem.)

> If I insert a space or a random ASCII character before the Euro sign
> and evaluate the form above (using windows-1252 for the encoding) the
> encoding is being identified correctly and both the u umlaut and the
> Euro sign are being displayed correctly.

Good!

...

>>How is Emacs supposed to infer the coding system from the contents of
>>that file?  If you can come up with a suitable customization, perhaps
>>it will be incorporated into Emacs as the default behavior.
>
> If I knew how to do that I would have sent a patch already.  My naive
> approach would be to look for the presence of bytes which are
> characteristic for Windows codepages in order to identify the encoding
> as a Windows codepage.

Right, but a single byte is not enough information to identify the
character encoding.  Even a pattern is not enough, since coding systems
may differ only in what characters are assigned to the same byte
sequence: sometimes you need "out of band" information.

Have you read the Recognize Coding node (aka Recognizing Coding Systems)
of the Emacs manual?

The Emacs implementors are less naive than you and me.  :-)

> Maybe looking at line endings can help to make the right decision.

That would be a very weak heuristic indeed.  A I understand it, Emacs is
very conservative in this regard: if a buffer contains only single \r
sequences, it's mac; if it contains only \n sequences, it's unix; if it
contains only \r\n sequences, it's DOS; but if it contains a mix, it is
indeterminate.

> After the encoding was identified to be a Windows
> codepage, the exact codepage could be chosen based on the language
> environment.  But this suggestion is just random guesswork from my
> side because I know close to nothing about what processes are involved
> in identifying an encoding.

Me neither, your idea sounds reasonable to me.  But I don't understand
why auto-coding-regexp-alist has such a high priority (over the coding:
tag).

>>Can Notepad display files in anything besides CP850/Windows-1252 and
>>probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
>>8859-2 from ISO 8859-15?
>
> As far as I understood Reiner on emacs-pretest-bug this is impossible
> anyway.

Just as windows-1252 can't be distinguished reliably from any other
coding systems that use bytes [\200-\237].

>>Yes, Windows applications simply assumes you're using a proprietary
>>Microsoft character set, and GNU/Linux apps prioritize support for
>>standard character encodings.  Maybe all you need is
>>(prefer-coding-system 'cp850)
>
> Wouldn't that be a bit too restricted as a general solution for Emacs?

Of course.  But we don't know whether this is a general problem for
Emacs or a specific problem for your configuration, nor in either case
whether it's a problem that can be solved.  As a scientist I'd like to
solve the most general case, but as an engineer I'd like to start by
solving the particular problem you've identified.

--
Kevin Rodgers





reply via email to

[Prev in Thread] Current Thread [Next in Thread]