emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: address@hidden: Coding problem with Euro sign]


From: Ralf Angeli
Subject: Re: address@hidden: Coding problem with Euro sign]
Date: Fri, 16 Dec 2005 12:55:47 +0100
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux)

* Kevin Rodgers (2005-12-15) writes:

> Ralf Angeli wrote:
>  > * Kevin Rodgers (2005-12-15) writes:
>  >
>  >>You could try something like this:
>  >>
>  >>(setq auto-coding-regexp-alist
>  >>       (cons '("[\040-\177][\200-\237]" . cp1252)
>  >>             auto-coding-regexp-alist))
>  >
>  > This doesn't seem to work here.  I still see the byte codes of the
>  > 8-bit characters when opening the file after evaluating the above
>  > form.
[...]
> I assume those display problems are because I haven't configured an
> Emacs fontset for the cp850 coding system.  But the
> auto-coding-regexp-alist entry worked as intended, and you're on
> Windows so your fontset should be properly configured for that.

Currently I am on GNU/Linux.  Anyway, with the development version of
Emacs I did not have the problems with cp1252 you described when
loading the file.  But when trying to write the file I got this
warning:

,----
| Warning (:warning): Invalid coding system `cp1252' is specified
| for the current buffer/file by the variable `auto-coding-regexp-alist'.
| It is highly recommended to fix it before writing to a file.
`----

I didn't do `M-x codepage-setup RET' before trying all of this.
Interestingly loading and writing the file worked fine if I used
windows-1252 instead of cp1252.

> One other detail: that entry only sets the coding system if the euro
> is immediately preceded by an ASCII character.  Is that the case in
> your file?

No.  On emacs-pretest-bug I already explained that the original (test)
file doesn't include the A circumflex, that means the euro is preceded
by a newline.  (Maybe it would be better to continue the discussion in
the thread on emacs-pretest-bug in order to avoid repetition?)

If I insert a space or a random ASCII character before the Euro sign
and evaluate the form above (using windows-1252 for the encoding) the
encoding is being identified correctly and both the u umlaut and the
Euro sign are being displayed correctly.

> What does `C-h C RET' say after visiting the file?

In case the encoding is not identfied correctly:

,----
| Coding system for saving this buffer:
|   t -- raw-text-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| 
| Priority order for recognizing coding systems when reading files:
|   1. iso-latin-1 (alias: iso-8859-1 latin-1)
|   2. mule-utf-8 (alias: utf-8)
|   3. mule-utf-16be-with-signature (alias: utf-16be-with-signature 
mule-utf-16-be utf-16-be)
|   4. mule-utf-16le-with-signature (alias: utf-16le-with-signature 
mule-utf-16-le utf-16-le)
|   5. iso-2022-jp (alias: junet)
|   6. iso-2022-7bit 
|   7. iso-2022-7bit-lock (alias: iso-2022-int-1)
|   8. iso-2022-8bit-ss2 
|   9. emacs-mule 
|   10. raw-text 
|   11. japanese-shift-jis (alias: shift_jis sjis cp932)
|   12. chinese-big5 (alias: big5 cn-big5 cp950)
|   13. no-conversion 
| 
|   Other coding systems cannot be distinguished automatically
|   from these, and therefore cannot be recognized automatically
|   with the present coding system priorities.
| 
|   The following are decoded correctly but recognized as iso-2022-7bit-lock:
|     iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext
|     iso-2022-jp-2 iso-2022-kr
| [...]
`----

In case the coding is identified correctly:

,----
| Coding system for saving this buffer:
|   * -- windows-1252-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| [...]
`----

> I assume you're running with multibyte characters enabled.

Yes.  The relevant setting should be included in the original bug
report.

>  > And a customization is actually not what I am interested in; I'd like
>  > Emacs to figure this out by itself, out of the box.
>
> How is Emacs supposed to infer the coding system from the contents of
> that file?  If you can come up with a suitable customization, perhaps
> it will be incorporated into Emacs as the default behavior.

If I knew how to do that I would have sent a patch already.  My naive
approach would be to look for the presence of bytes which are
characteristic for Windows codepages in order to identify the encoding
as a Windows codepage.  Maybe looking at line endings can help to make
the right decision.  After the encoding was identified to be a Windows
codepage, the exact codepage could be chosen based on the language
environment.  But this suggestion is just random guesswork from my
side because I know close to nothing about what processes are involved
in identifying an encoding.

> Can Notepad display files in anything besides CP850/Windows-1252 and
> probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
> 8859-2 from ISO 8859-15?

As far as I understood Reiner on emacs-pretest-bug this is impossible
anyway.

> Yes, Windows applications simply assumes you're using a proprietary
> Microsoft character set, and GNU/Linux apps prioritize support for
> standard character encodings.  Maybe all you need is
> (prefer-coding-system 'cp850)

Wouldn't that be a bit too restricted as a general solution for Emacs?

-- 
Ralf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]