emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Rmail and the raw-text coding system


From: Mark Lillibridge
Subject: Rmail and the raw-text coding system
Date: Fri, 14 Jan 2011 10:14:06 -0800

[resend of this message again as eggs.gnu.org refused to accept it for
five days]

----

[this is a follow-up to a previous message which appears to be delayed
so it might appear afterwards]

[all the below is emacs version 22; the code fragments here are for
additional information and probably don't need to be read the first
time]


    Rmail uses encoding and decoding somewhat weirdly because it must
mix messages of different encodings in the same file.  It reads in a
file as follows:

  (let* ((file-name (expand-file-name (or file-name-arg rmail-file-name)))
         ...
         ;; Since the file may contain messages of different encodings
         ;; at the tail (non-BYBYL part), we can't decode them at once
         ;; on reading.  So, at first, we read the file without text
         ;; code conversion, then decode the messages one by one by
         ;; rmail-decode-babyl-format or
         ;; rmail-convert-to-babyl-format.
         (coding-system-for-read (and rmail-enable-multibyte 'raw-text))
         run-mail-hook msg-shown)
         ...
      (switch-to-buffer
       (let ((enable-local-variables nil))
         (find-file-noselect file-name))))

That is, it effectively visits the file using the encoding raw-text.
Because of the black magic of raw-text, unlike with most other
encodings, the result is a unibyte buffer.


    It then decodes the BABYL message part:

    (unless (and coding-system
                 (coding-system-p coding-system))
      (setq coding-system
            ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
            ;; earlier versions did that with the current buffer's encoding.
            ;; So we want to favor detection of emacs-mule (whose normal
            ;; priority is quite low), but still allow detection of other
            ;; encodings if emacs-mule won't fit.  The call to
            ;; detect-coding-with-priority below achieves that.
            (car (detect-coding-with-priority
                  from to
                  '((coding-category-emacs-mule . emacs-mule))))))
    (unless (memq coding-system
                  '(undecided undecided-unix))
      (set-buffer-modified-p t)         ; avoid locking when decoding
      (let ((buffer-undo-list t))
        (decode-coding-region from to coding-system))
      (setq coding-system last-coding-system-used))
    (set-buffer-modified-p modifiedp)
    (setq buffer-file-coding-system nil)
    (setq save-buffer-coding-system
          (or coding-system 'undecided))))

This process leaves the buffer as a unibyte buffer.  


    It also separately decodes each non-BABYL message at the end
separately.  It does this after decoding base64 and quoted-printable
encoded message bodies that have type text or message.  (This is a
different kind of decoding than the coding system one.)

                   (let ((mime-charset
                          (if (and rmail-decode-mime-charset
                                   (save-excursion
                                     (goto-char start)
                                     (search-forward "\n\n" nil t)
                                     (let ((case-fold-search t))
                                       (re-search-backward
                                        rmail-mime-charset-pattern
                                        start t))))
                              (intern (downcase (match-string 1))))))
                     (rmail-decode-region start (point) mime-charset)))


;; Decode the region specified by FROM and TO by CODING.
;; If CODING is nil or an invalid coding system, decode by `undecided'.
(defun rmail-decode-region (from to coding)
  (if (or (not coding) (not (coding-system-p coding)))
      (setq coding 'undecided))
  ;; Use -dos decoding, to remove ^M characters left from base64 or
  ;; rogue qp-encoded text.
  (decode-coding-region from to
                        (coding-system-change-eol-conversion coding 1))
  ;; Don't reveal the fact we used -dos decoding, as users generally
  ;; will not expect the RMAIL buffer to use DOS EOL format.
  (setq buffer-file-coding-system
        (setq last-coding-system-used
              (coding-system-change-eol-conversion coding 0))))

Note that if the headers don't specify a coding system, then we fall
back to undecided.  Finally, Rmail converts the buffer to multibyte.  



    So long as decoding new messages ends up using a coding system other
than raw-text*, this all works correctly.  Unfortunately, it appears
that sometimes decode-coding-region when passed undecided will decide to
use raw-text-unix.  I assume this is due to messages mixing incompatible
encodings (perhaps UTF-8 and Big5?).  I don't know if perfectly valid
messages can cause this problem, but God knows there's enough
badly formatted messages out there that mix formats.

    When visiting a strange file, using raw-text* can make sense, since
the resulting buffer will be unibyte, preserving the exact sequence of
bytes in the file.  When written out, the same bytes will be replaced.
Unfortunately, however Rmail buffers are always multibyte (excepting
weird cases where the user has requested all buffers be unibyte),
causing problems.

    Rmail takes the perfectly sensible unibyte raw-text* representation
and converts it to multibyte as part of converting its entire buffer to
multibyte.  This does *not* do what you might expect, namely convert
x80-xff characters to raw 8-bit bytes.  Rather, it effectively casts
those bytes directly to emacs internal representation unchanged.  (This
is true for byte sequences that are valid internal representations; I
believe that malformed internal representation sequences are escaped so
that writing them reproduces the same bytes; in particular,
unaccompanied continuation bytes (160-255) are turned into raw 8-bit
bytes.)

    Note that raw-text* is very weird because it is the only conversion
that does not necessarily leave a valid internal representation in a
unibyte buffer after decoding.  


    The result of all this is a message in the Rmail buffer containing
possibly arbitrary codepoints.  Needless to say, this produces a weird
display instead of the \xxx display the user might expect in this case.
When it comes time to save the Rmail file, we may no longer be able to
use emacs-mule because of the weird code points, forcing us to use
raw-text-unix as the encoding of the Rmail file itself!  This makes
unrmail's life much more difficult, especially in version 23 as the
internal buffer representation has changed.  I'll have more to say about
this in a reply to my previous message.

    I do not believe that any of this causes data loss at the byte level
in and of itself (reading raw-text into unibyte, converting to
multibyte, and then writing it out using encoding scheme raw-text leaves
the bytes unchanged I think, except possibly for the last code point at
the end of the message), but someone who understands these issues better
than me should think hard about this.  Unrmail, if not rewritten
carefully, *will* cause data loss because of this, though...

    I don't know if we are still actively maintaining the version 22 of
rmail, but if so someone should fix Rmail so it no longer uses raw-text*
for decoding messages; instead a coding system should be used which
converts 0x80-xff to raw 8-bit bytes.  (Is there already such a system?)

- Mark




reply via email to

[Prev in Thread] Current Thread [Next in Thread]