emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need some help with Rmail/mbox


From: Stephen J. Turnbull
Subject: Re: Need some help with Rmail/mbox
Date: Sat, 20 Sep 2008 16:12:33 +0900

Paul Michael Reilly writes:

 > Thanks for stepping up to this.  Your help is very much appreciated!

You're welcome.  Eli and Richard have already responded with some
existing Rmail features, but maybe some background (somewhat
duplicating their comments) would be helpful, too.

 > I first copy the relevant headers to the view buffer by collecting
 > them from the PMAIL buffer into a string and insert the string into
 > the view buffer.

Copying to a string uses memory.  The amount of memory is not a huge
consideration these days, even with a multimegabyte buffer.  But
allocating and deallocating strings is very time-consuming because
malloc requires a system call, and deallocated strings' data gets
compacted, or possibly another system call to deallocate for large
strings (I forget if Emacs uses direct allocation for large strings
instead of expanding the string data pool).

Also, strings are read-only.  So to "edit" a string, you actually have
to copy the relevant parts to a new string; if you substitute in the
middle of a string, you have to create a bunch of new strings one for
each fragment, then a final string.  Lotsa consing.

 > I used the rmail.el code pretty much as is but instead of copying
 > and hiding I do selective copy and insert (ignoring the case of
 > showing all headers which is trivial).

That's reasonable, I think.

 > Then I basically copy the message body into a string and insert it
 > into the view buffer.

`insert-buffer-substring' is much more efficient.

 > But when I started to work on the decoding it seemed that decoding
 > the string before inserting it seemed like a good idea.

In XEmacs, string decoding is implemented by copying to a temporary
buffer and doing decode-coding-region there.  Emacs is likely the
same.  :-)

 > Are you essentially answering my question above and saying that
 > copying buffer to buffer is faster/better than operating on strings?

Yes.  It's faster and better.  Buffers are designed for editing.
Strings are designed for read-only text to save all the editor
overhead that buffers carry around.  Here's just one reason.  Emacs
strings are *not* arrays of characters, they are arrays of bytes,
which (from Lisp) can only be read at character boundaries.  An ASCII
character takes up 1 byte, a Latin-1 character 2 bytes, a Japanese
character 3 bytes, and (IIRC) certain user-defined characters may take
4 bytes!  This means that if you decide to substitute a Latin 1 SMALL
LATIN LETTER A WITH GRAVE ACCENT for ASCII SMALL LATIN LETTER A (thus
turning voila into voilĂ ) you can't do it in a string without
allocating a new string.

 > I do parse out quoted-printable and base64 and apply these to the body
 > before doing the coding system based decoding.

OK.

 > > Identify header and body, add Babyl sentinels if desired
 > 
 > babyl sentinels?  I'm not sure what you mean by this.

Babyl messages are delimited with "^_" IIRC, and the original headers
with "**** BOOH ****" and "**** EOOH ****" or something like that.  I
don't remember whether any code that presents a message uses those
after narrowing (in your implementation, copying), though.  If it's
not used, you don't need them.

 > "yup" and that is what led to my request for help.  Except for the
 > case of quoted-printable and base64 I'm not sure how to parse those
 > two headers (Content-Type and Content-Transfer-Encoding) into a coding
 > system so that I can then do the decoding.

Content-Transfer-Encoding is about how bytes, *not characters*, are
represented.  For practical purposes there are four possibilities:
text is all ASCII (the default, aka 7bit), text is raw unibyte (8bit),
text is encoded as quoted-printable, and text is encoded as BASE64.
So you are done with that.

This is entirely independent of Content-Type or its charset parameter.

 > I'm assuming the coding system guesswork becomes relevant for
 > combinations of the two headers that Rmail does not grok.

No.  If there is no Content-Type header, you "should" assume the RFC
2822 defaults (text/plain; charset=US-ASCII).  Providing commands for
the user to change those on a per message basis would be nice, but not
needed for a first release as the vast majority of non-spam mail is
MIME-conformant these days.

 > And I now see that there is a strong relationship between charset
 > and coding system.

Technically, the *MIME charset* concept is broken, or at least a very
poor name.  A "character set" is an abstract idea that is (AFAIK)
basically unstandardized.  A *coded character set* is an invertible
mapping from a set of non-negative integers to characters.  You can
think of Unicode as a universe of characters, although that's not
quite good enough for some esoteric purposes.  What Emacs calls a
"charset" is basically a coded character set.  An "encoding" is again
an abstract idea which is not really standardized, but it's pretty
close to what Emacs calls a "coding system", which is a pair of
algorithms for decoding an external text into an Emacs buffer, and for
doing the reverse, plus some auxiliary parameters and functions for
specialized purposes (eg, for detecting the encoding of an unknown
text).  As you recognized, this is basically the same thing as a "MIME
charset".

You should not need to deal with Emacs charsets, by the way.  Just
remember that "MIME charset == Emacs coding-system" and you'll do
fine.

 > OK, this is helpful.  I assume that for all other type/subtype cases
 > we punt for now and use guessing or just raw text?

For text/* types, just use the raw text (there should be a charset
parameter if it is not ASCII).

 > But certainly there are some that we want to process/decode in some
 > fashion, e.g. text/html or text/xml.  Is there another Emacs
 > package/library that you are aware of that provides a good model
 > for where we want to take Rmail so that it handles more
 > type/subtype cases seamlessly in the view buffer?

Gnus, VM, tm (aka "Tiny MIME", obsoleted by SEMI and unsupported),
SEMI (obsolete and unsupported I believe), WEMI (IIRC a C library to
link into Emacs, based on SEMI, obsolete and unsupported I guess),
MH-E, MEW, Wanderlust (these last three I don't know about the
implementations, they may borrow from Gnus).

Both VM and Gnus use the model I suggested of dispatching on type and
subtype.  Some naming convention like `mime-handler-TYPE/SUBTYPE'
could be used.

    (let ((handlers (list (intern (format "mime-handler-%s/%s" type subtype))
                          (intern (format "mime-handler-%s/*" type))
                          'mime-handler-*/*))
          handler)
      (while handlers
        (setq handler (car handlers)
              handlers (cdr handlers))
        (if (functionp handler)
            (funcall handler body-start body-end)
          ;; `warn' may be an XEmacs-ism, sorry
          (warn "handler not defined: %s" handler))))

 > Even perhaps audio and video (not pure MIME, i.e. multipart
 > ... yet).

You *need* multipart as quickly as possible.  Too much mail is sent
as multipart.  It's not that hard, you just parse the MIME bodies
recursively, and throw away the bodies you don't know how to handle.
I'm sure Rmail already knows how to do this.

You should also provide a way of listing MIME bodies found and saving
their raw bytes to a file.  (That's just a matter of applying the
relevant Content-Transfer-Encoding to the MIME body, and then
write-region.)

 > >         Wash header for presentation, eg:
 > >             Hide non-displayed header
 > >             Decode RFC 2047-encoded headers
 > 
 > OK, this is helpful but I would add that non-displayed headers do not
 > need to be in the view buffer at all.  It contains all the headers or
 > just the displayed headers, depending on the User's current desire.

I find being able to toggle display of the full set of headers useful,
and I use it several times every day.  I would find this easier to
implement if the headers are there but hidden.  YMMV, of course.

 > >         Wash body for presentation, eg:
 > >             Highlight and activate url-like substrings
 > >             Highlight quoted material
 > 
 > I don't believe Rmail does either of these operations now.  Is that
 > your understanding?

I count the interval that I've not used Rmail by decades. :-)  My
contribution is as a standards geek and having gotten my hands dirty
on several MUAs.

URLs are easy, of course:

    (while (re-search-forward url-re nil t)
      (let ((o (make-overlay (match-beginning 0) (match-end 0))))
        (overlay-put o 'face 'url-active-face)
        ;; sorry, this may also be an XEmacs-ism
        (overlay-put o APPROPRIATE-ARGS-TO-ADD-FOLLOW-URL-TO-KEYMAP))

Quoting is harder because of the variety of quoting styles.  You might
want to make this easy for users to configure.  Kyle Jones's filladapt
package is quite good at detecting quoting styles and is configurable.
As you know, Kyle is a curmudgeon about assignment, but reading the
docs for ideas about UI is probably OK (but check with FSF legal or
Richard; IANAL nor an FSF spokesperson).




reply via email to

[Prev in Thread] Current Thread [Next in Thread]