bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13505: Bug#696026: emacs24: file corruption on saving


From: Vincent Lefevre
Subject: bug#13505: Bug#696026: emacs24: file corruption on saving
Date: Tue, 22 Jan 2013 03:35:57 +0100
User-agent: Mutt/1.5.21-6291-vl-r57386 (2013-01-20)

On 2013-01-21 19:55:20 +0200, Eli Zaretskii wrote:
> > Date: Mon, 21 Jan 2013 05:14:10 +0100
> > From: Vincent Lefevre <vincent@vinc17.net>
> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org,
> >     696026-forwarded@bugs.debian.org, 696026@bugs.debian.org
> > 
> > On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote:
> > > > You said:
> > > > 
> > > > | The original encoded form of the characters as found on disk at
> > > > | visit time _cannot_ be recovered by saving with raw-text, because
> > > > | that encoded form is lost without a trace when the file is _visited_
> > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > | and decoded into the internal representation.
> > > > 
> > > > This is what lossy is.
> > > 
> > > In that sense, every encoding except no-conversion is lossy.
> > 
> > Even 8-bit encodings such as latin-1?
> 
> Yes.  When latin-1 characters are decoded (as part of visiting a
> file), they are converted to the internal representation, and cease to
> be single 8-bit bytes.

Any example where saving the file without modifying it (see below)
would modify the data (as a sequence of bytes on the disk)?

> > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs
> > > > seems to handle files with invalid UTF-8 sequences without any loss.
> > > > So, this encoding is safe, even if Emacs wrongly guess the encoding.
> > > 
> > > No, it isn't, although you could get away with it most of the time.
> > 
> > Could you give an example where one loses data with the utf-8 encoding?
> 
> E.g., in your test file, the byte whose value is 0x80 is converted to
> 0x3fff80 when the file is read into a buffer.

No, there are no problems with this example:

$ printf "\x80" > file
$ hd file
00000000  80                                                |.|
00000001
$ emacs -q file

Here the encoding by Emacs is utf-8-unix. Then I do
  M-: (set-buffer-modified-p t)
to mark the buffer as modified (as in the bug report)., then
C-x C-s. Emacs proposes raw-text, which I choose. Then C-x C-c
to quit.

$ hd file
00000000  80                                                |.|
00000001

So, the file has *not* been corrupted.

> Perhaps by "lossless" you mean "reversible", in the sense that saving
> the same buffer will perform the reverse conversion.

Actually I don't mind what occurs internally. What I mean is things
like: saved file = initial file if it hasn't been modified (as above)
and with the default encoding(s) proposed by Emacs (when visiting and
when saving).

> In that case, even the in-is13194-devanagari-unix is reversible: if
> you type this encoding when Emacs prompts you to select one of the
> coding systems, then you get the same file on disk with no
> corruption whatsoever.

Then this is what Emacs should propose by default on this example!
I suppose that Emacs is able to remember the encoding used to visit
the file, so that this should be possible...

> > > > But Emacs should clearly tell the user what to do after C-x C-s and
> > > > clearly say when there can be data loss.
> > > 
> > > At save time, "data loss" is wrt what's in the buffer.  In that sense,
> > > the encodings Emacs suggested don't lose any data.
> > 
> > "data loss" is the difference between the original file and the saved
> > file.
> 
> But what do you want Emacs to do with this?  When you save the buffer,
> the original file might be different or no longer be available (or not
> accessible even in principle, e.g. if the data came from a
> subprocess).

The file may be different, but in general, the encoding should remain
the same. This is particularly true when Emacs is used as the editor
by some application: if the encoding of the file has been changed by
Emacs, the application will be confused.

> These issues should be detected at file visit time, if at all, not
> at buffer save time.

Possibly (this is something that the end user doesn't have to know if
the goal is to modify a file).

> > > > Then Emacs says: "Select one of the safe coding systems listed below
> > > > [...]", but doesn't say that something has already been lost. So, the
> > > > words "safe coding systems" are really misleading.
> > > 
> > > It's misleading because you misunderstand what is "safe" at buffer
> > > save time.
> > 
> > No, it's misleading because Emacs didn't say that data were lost
> > when visiting the file.
> 
> Let's be constructive here.  Please suggest some practical way for
> Emacs to handle this situation better.
> 
> For the record, here are the various alternative ways Emacs supports
> the use case you described, when a file with inconsistent encoding
> needs to be repaired manually:
> 
>  . Visit the file with "M-x find-file-literally RET".  This yields a
>    unibyte buffer, where each byte stands for itself, and which you
>    can edit without risking en-/decoding issues.

Though the above is possible, the user often opens files with
"emacs <file>".

>  . Visit the file normally, then type "M-x hexl-mode RET" (or use 
>    "M-x hexl-find-file RET" to visit it in the first place).  This
>    revisits (or visits) the file in a unibyte buffer, and in addition
>    lets you edit the binary stuff regardless of its graphic
>    representation.

If Emacs notices a potential problem when visiting the file, this
method can be proposed by Emacs, but it shouldn't be the only way,
because the file may contain mostly ASCII characters and hex-editing
is not the best choice in such a case.

>  . After visiting the file normally and noticing that it contains
>    weird characters, or after being prompted to select a coding system
>    when saving the buffer, type "C-x RET r raw-text RET" to revisit
>    the file in raw-text encoding.  Then edit the bytes and save the
>    file.

But that could be proposed by Emacs directly: instead of decoding the
file directly in the buffer, Emacs could ask the user which coding
system he wants to use.

One drawback of raw-text is that 8-bit characters are completely
unreadable. I think that there should be, for instance, a utf-8
degraded coding system: correct UTF-8 sequences are decoded using
UTF-8, and invalid sequences are left intact. Emacs can already
do such kind of things, but there should be 2 differences from
the current behavior:

* When visiting the file, ask the user what to do in case Emacs
  cannot select a clean coding system without any problem. For
  instance, a "Select coding system" prompt. (BTW, couldn't hexl
  be regarded as a special coding system at this point? Perhaps
  "coding system" isn't the right term here, "editing mode" might
  be better.) Other settings in .emacs could override that,
  of course, i.e. this would just be the default.

* In case of UTF-8 degraded coding system, Emacs should save the
  file in the same UTF-8 degraded coding system. This is a way
  for the user to say: "I know that there are invalid sequences,
  just keep them."

UTF-8 is just an example above. There could be the same kind of
things with other encodings.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]