help needed with coding systems (unrmail problems)

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

help needed with coding systems (unrmail problems)

From:	Mark Lillibridge
Subject:	help needed with coding systems (unrmail problems)
Date:	Thu, 13 Jan 2011 15:22:40 -0800

[eggs.gnu.org took 5 days to bounce this; I have already replied to this
with more information...]


    I'm at my wit's end trying to debug a subtle and nasty unrmail bug
where unrmail mangles the character encodings.  I'm many, many hours
down this particular rathole, but let me try and explain the problem
"briefly".  Please ask for clarification or more experiments as needed.


    Ok, I have a Rmail Babyl file whose contents are correctly encoded
via raw-text-unix (V22) -- for those curious, I believe this can be
caused by receiving a MIME e-mail with two parts with different and
incompatible encodings.  One of the messages contains a Latin-1 u with
two dots over it (below from a V22 emacs):

  character: ü (2300, #o4374, #x8fc, U+00FC)
    charset: latin-iso8859-1
             (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.)
 code point: #x7C
     syntax: w  which means: word
   category: l:Latin
buffer code: #x81 #xFC
  file code: #xC3 #xBC (encoded by coding system mule-utf-8-unix)

I have verified that this character is represented on disk as 81 FC
(hex).  If I visit that file literally (also), I see \201\374, which is
octal for 81 FC as expected.


    When I fire up unrmail on this file, it first reads it in as
"raw-text-unix":

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)


It then decodes the main part of the file containing the messages:

      (unless (and coding-system
                   (coding-system-p coding-system))
        (setq coding-system
              ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
              ;; earlier versions did that with the current buffer's encoding.
              ;; So we want to favor detection of emacs-mule (whose normal
              ;; priority is quite low), but still allow detection of other
              ;; encodings if emacs-mule won't fit.  The call to
              ;; detect-coding-with-priority below achieves that.
              (car (detect-coding-with-priority
                    from to
                    '((coding-category-emacs-mule . emacs-mule))))))
      (message "decoding file with %s" coding-system)
      (unless (memq coding-system
                    '(undecided undecided-unix))
        (set-buffer-modified-p t)       ; avoid locking when decoding
        (let ((buffer-undo-list t))
          (decode-coding-region from to coding-system))
        (setq coding-system last-coding-system-used))
      (message "actual coding system used: %s" coding-system)

I have verified via the inserted message calls above that it is decoding
using raw-text-unix here.


    It then writes out the modified message (after rewritting some
headers and the like; no changes to 8 bit characters) by encoding using
the coding system that message was originally decoded with
(mule-utf-8-unix):

              ;; If the message specifies a coding system, use it.
              (let ((maybe-coding (mail-fetch-field "X-Coding-System")))
                (if maybe-coding
                    (setq coding
                          ;; Force Unix EOLs.
                          (coding-system-change-eol-conversion
                           (intern maybe-coding) 0))
                  ;; If there's no X-Coding-System header, assume the
                  ;; message was never decoded.
                  (setq coding 'raw-text-unix)))
            ...
            ;; Write it to the output file, suitably encoded.
            ;(debug)
            (let ((coding-system-for-write coding))
              (write-region (point-min) (point-max) to-file t
                            'nomsg))
            (message "was %s now %s" coding last-coding-system-used)

Again, I verified via the inserted message call that this is correctly
mule-utf-8-unix.


    In a sane universe, this would result in the message in the output
file containing the UTF-8 for this character, C3 BC.  However, what I
actually get is 81 FC -- the same as we started with!  

    I conjecture that this is caused by the change in Emacs's internal
representation.  Whereas raw-text-unix -> mule-utf-8-unix on V22 is an
encoding change, in V23 it probably is not, at least for sane byte
sequences.  (Remember that we are running unrmail on a V23 emacs.)  Can
anyone verify this conjecture?  Google pretty much returns nothing
useful for information on how emacs' coding systems work.


    Ok, I said, if true, there should be an easy workaround for now: run
unrmail on a V22 emacs instead.  I did so, and the debugging messages
show the same coding system names being used.  However, now the file
contains C2 81 FC, which is still wrong!  More mysteriously, if I read
in that file in V22 using raw-text-unix (being careful to disable the
auto start rmail on buffer part) and then write the file out using
mule-utf-8-unix I *do* get the expected C3 BC.


    So something about the exact way that unrmail is doing things is
messing things up.  As a test, I stopped unrmail after it read in the
file but before decoded it:

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)

If I write out that buffer via write-region using coding system
mule-utf-8-unix, I get the error (C2 81 FC) in the output file.  The
same thing happens if I do this just before the setting of the buffer to
multibyte.  Mind you, I see the same characters (\201\374) in the buffer
in all three cases before I write it out so some invisible property of
the buffer must be different.


    So insert-file-contents is doing something differently from just
visiting the file that matters.  Unfortunately, the help documentation
for insert-file-contents gives no help on this.


    Does anyone have any ideas on what might be going on?

- Thanks,
  Mark

[Prev in Thread]

Current Thread

[Next in Thread]

help needed with coding systems (unrmail problems), Mark Lillibridge <=
- Re: help needed with coding systems (unrmail problems), Stefan Monnier, 2011/01/13
  - Re: help needed with coding systems (unrmail problems), Mark Lillibridge, 2011/01/14

Prev by Date: Re: Bikeshedding go! Why is <M-f4> unbound?
Next by Date: Re: Bikeshedding go! Why is <M-f4> unbound?
Previous by thread: some bzrmerge.el questions
Next by thread: Re: help needed with coding systems (unrmail problems)
Index(es):
- Date
- Thread