bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13505: Bug#696026: emacs24: file corruption on saving


From: Eli Zaretskii
Subject: bug#13505: Bug#696026: emacs24: file corruption on saving
Date: Tue, 22 Jan 2013 09:56:44 +0200

> Date: Tue, 22 Jan 2013 03:35:57 +0100
> From: Vincent Lefevre <vincent@vinc17.net>
> Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org,
>       696026-forwarded@bugs.debian.org, 696026@bugs.debian.org
> 
> > > > > | The original encoded form of the characters as found on disk at
> > > > > | visit time _cannot_ be recovered by saving with raw-text, because
> > > > > | that encoded form is lost without a trace when the file is _visited_
> > > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > | and decoded into the internal representation.
> > > > > 
> > > > > This is what lossy is.
> > > > 
> > > > In that sense, every encoding except no-conversion is lossy.
> > > 
> > > Even 8-bit encodings such as latin-1?
> > 
> > Yes.  When latin-1 characters are decoded (as part of visiting a
> > file), they are converted to the internal representation, and cease to
> > be single 8-bit bytes.
> 
> Any example where saving the file without modifying it (see below)
> would modify the data (as a sequence of bytes on the disk)?

See above: I was talking about changes at file-visit time.

> > > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs
> > > > > seems to handle files with invalid UTF-8 sequences without any loss.
> > > > > So, this encoding is safe, even if Emacs wrongly guess the encoding.
> > > > 
> > > > No, it isn't, although you could get away with it most of the time.
> > > 
> > > Could you give an example where one loses data with the utf-8 encoding?
> > 
> > E.g., in your test file, the byte whose value is 0x80 is converted to
> > 0x3fff80 when the file is read into a buffer.
> 
> No, there are no problems with this example:

Again, because we are talking about two different things.

> > Perhaps by "lossless" you mean "reversible", in the sense that saving
> > the same buffer will perform the reverse conversion.
> 
> Actually I don't mind what occurs internally. What I mean is things
> like: saved file = initial file if it hasn't been modified (as above)
> and with the default encoding(s) proposed by Emacs (when visiting and
> when saving).

That's reversibility.

> > In that case, even the in-is13194-devanagari-unix is reversible: if
> > you type this encoding when Emacs prompts you to select one of the
> > coding systems, then you get the same file on disk with no
> > corruption whatsoever.
> 
> Then this is what Emacs should propose by default on this example!

It can't easily do that.

There are 2 different use cases here:

 1) A file was visited and its encoding was found to be inconsistent.
    Then it is being saved.  This is your use case.

 2) A file was modified by adding to it characters that cannot be
    encoded by the original encoding.  For example, you visit a
    Latin-1 encoded file, then add to it characters that are outside
    the coverage of Latin-1.  Then you save the file.

What Emacs proposes is biased for the second use case, because it is
by far the most frequent one.  The other use case is supposed to be
treated by other means, those which I mentioned in my previous mail.

Giving instructions to both use cases is not a good idea, IMO, because
it will confuse users who do not necessarily understand what is going
on and in particular don't realize which of the two situations they
are in.

> I suppose that Emacs is able to remember the encoding used to visit
> the file, so that this should be possible...

It does remember.  It actually shows it in the "select safe coding
system" prompt.  The problem is that its use can do the wrong thing in
the second use case above.

> > > > > But Emacs should clearly tell the user what to do after C-x C-s and
> > > > > clearly say when there can be data loss.
> > > > 
> > > > At save time, "data loss" is wrt what's in the buffer.  In that sense,
> > > > the encodings Emacs suggested don't lose any data.
> > > 
> > > "data loss" is the difference between the original file and the saved
> > > file.
> > 
> > But what do you want Emacs to do with this?  When you save the buffer,
> > the original file might be different or no longer be available (or not
> > accessible even in principle, e.g. if the data came from a
> > subprocess).
> 
> The file may be different, but in general, the encoding should remain
> the same.

That's what Emacs does, as long as it can.  But in this case, that
encoding might produce inconsistently encoded file, so Emacs doesn't
want to do that silently.  It has no idea that the file was
inconsistently encoded in the first place, nor that you _want_ it to
continue being inconsistently encoded.

> This is particularly true when Emacs is used as the editor by some
> application: if the encoding of the file has been changed by Emacs,
> the application will be confused.

Again, that's what Emacs does normally, if that encoding can do the
job.  Producing inconsistent encoding will certainly confuse those
other programs.

> > These issues should be detected at file visit time, if at all, not
> > at buffer save time.
> 
> Possibly (this is something that the end user doesn't have to know if
> the goal is to modify a file).

This use case proves otherwise.

> >  . Visit the file with "M-x find-file-literally RET".  This yields a
> >    unibyte buffer, where each byte stands for itself, and which you
> >    can edit without risking en-/decoding issues.
> 
> Though the above is possible, the user often opens files with
> "emacs <file>".

Many users have Emacs up and running for the entire session.

> >  . Visit the file normally, then type "M-x hexl-mode RET" (or use 
> >    "M-x hexl-find-file RET" to visit it in the first place).  This
> >    revisits (or visits) the file in a unibyte buffer, and in addition
> >    lets you edit the binary stuff regardless of its graphic
> >    representation.
> 
> If Emacs notices a potential problem when visiting the file, this
> method can be proposed by Emacs, but it shouldn't be the only way,
> because the file may contain mostly ASCII characters and hex-editing
> is not the best choice in such a case.

??? Hexl Mode shows the printable characters (at the right side of the
display) in addition to the codes.  What exactly is the problem here?

> >  . After visiting the file normally and noticing that it contains
> >    weird characters, or after being prompted to select a coding system
> >    when saving the buffer, type "C-x RET r raw-text RET" to revisit
> >    the file in raw-text encoding.  Then edit the bytes and save the
> >    file.
> 
> But that could be proposed by Emacs directly: instead of decoding the
> file directly in the buffer, Emacs could ask the user which coding
> system he wants to use.

That'd be a nuisance, I think, because more often than not, keeping
the original inconsistent encoding is not what the user wants.

> One drawback of raw-text is that 8-bit characters are completely
> unreadable.

That's why I listed it the last.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]