bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs doesn't write with the encoding it used for reading


From: Eli Zaretskii
Subject: Re: Emacs doesn't write with the encoding it used for reading
Date: Fri, 05 Apr 2002 19:17:12 +0300

> From: Rommerskirchen Heinrich <Heinrich.Rommerskirchen@icn.siemens.de>
> Date: Fri, 5 Apr 2002 14:59:54 +0200
> 
> If a file contains a mixture of German DOS-encoded text (cp850) and
> German Windows-encoded text it is read as latin1 but emacs will not
> write it back as latin1.

This is a known issue.  As surprising as it sounds, it actually makes
sense (IMHO): this is how the user becomes aware that her files have
inconsistent encoding.  Normally, such files are corrupted in some
way.

When Emacs reads a file with random 8-bit bytes that don't fit
Latin-1, it decodes those bytes into a special character set reserved
for decoding binary bytes.  To see what does Emacs thinks about those
characters, go to one of them and type "C-u C-x =".  You will see that
they are not Latin-1 characters, as far as Emacs is concerned.

It should be possible to allow Emacs to write those bytes when you
save a Latin-1 buffer, but doing so using the existing Emacs machinery
has unpleasant side effects, so it was decided not to do that.

What practical problems do you have with this?  That is, when does it
make sense to have a file that mixes Latin-1 and DOS cp850 encoding?

> Saving with utf8 gives a file with 6 bytes, which emacs again reads
> as latin1

You expect Emacs to guess that the file you saved is in UTF-8.
However, UTF-8 encoding cannot be easily distinguished from other
signe-byte encodings.  Therefore, Emacs uses a priority list, whereby
it chooses the first single-byte encoding from the list.  The default
configuration puts UTF-8 very far from the beginning of the list, so
Emacs guesses wrong.  You should either make UTF-8 your preferred
coding system, or force Emacs to read the file as UTF-8 with "C-x RET c".

> Saving the original file (4 bytes) with raw-text gives a file with 5
> bytes which doesn't change anymore on reading and writing (emacs
> uses encoding raw-text-dos), it contains the original bytes plus a
> spurious \201 (Umlaut-u in DOS encoding)

Right.  If you need to edit a file that mixes several encodings, you
should visit it with raw-text.  Then saving it with raw-text will do
what you expect.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]