[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new sc
From: |
Jon Steinhart |
Subject: |
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan |
Date: |
Sun, 14 Nov 2010 19:00:24 -0800 |
> On Sun, Nov 14, 2010 at 11:45 AM, Jon Steinhart <address@hidden> wrote:
> > My preference is to say that we'll treat any =?...?= as an encoded word
> > wherever it appears and that we'll decode it. It appears that the authors
> > of
> > RFC2047 expect that everything will be parsed into tokens and examined
> > before
> > looking for encoded words.
>
> You right. RFC 822 defined the basic tokenization rules,
> and MIME attempts to stay compatibile with that. I.e. You have
> a system that knows how to due RFC 822 tokenization, and then
> that token data can be passed to the MIME-aware layer.
> Here is a relevant note from RFC 2047:
>
> IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
> by an RFC 822 parser. As a consequence, unencoded white space
> characters (such as SPACE and HTAB) are FORBIDDEN within an
> 'encoded-word'. For example, the character sequence
>
> =?iso-8859-1?q?this is some text?=
>
> would be parsed as four 'atom's, rather than as a single 'atom' (by
> an RFC 822 parser) or 'encoded-word' (by a parser which understands
> 'encoded-words'). The correct way to encode the string "this is some
> text" is to encode the SPACE characters as well, e.g.
>
> =?iso-8859-1?q?this=20is=20some=20text?=
Well sure, that's in the RFC but it doesn't really make a lot of sense to me.
Would be way more sensible in my opinion to decode everything and then parse it
as it would eliminate a zillion special cases in RFC-land. And, the fact that
you can't have an encoded word for H next to an encoded word for I to make HI
just leads to to the RFC2231 ugliness. In any case, they chose to do it the
overly complex way. But my question is really what do do when somebody sends
me this:
=?iso-8859-1?q?this is some text?=
Seems more sensible to treat the whole thing as an encoded word and to decode
it.
Are you suggesting that I should just treat it as text and not decode it?
> As for space between encoded word, such space should be
> collapsed. I.e. Two adjacent encoded words should be
> concatenated together after decoding, with no space between
> them.
Where in what RFC do you find this. RFC2047 section 5, (1) says that encoded
words must be separated from each other by linear white space but doesn't say
that that white space is later removed. Collapsing out the space doesn't seem
to make sense since there seems to be some intent to treat encoded words as
normal text words which one would normally separate by spaces.
> Note, it is a mistake to blindly assume that all sequences
> of =?...?= should be decoded, which has lead to some erroneous
> uses by some software. For example, using =?...?= inside
> parameter values vs using RFC 2184 (now RFC 2123).
Hmm. Where in what RFC is this prohibited? I'll agree that it doesn't make
a whole lot of sense to have so many mechanisms that do the same thing, but what
harm would come from this if it was decoded properly.
So once again, I'm not asking what is proper when encoding a message. I'm
asking
for guidance on sensible behavior when decoding an improperly encoded message.
I'm unaware of any cases where the character sequences for encoded words would
appear in any properly formatted items such as dates or addresses. So it seems
to me that no harm would be done if I decoded such illegal stuff anyway as the
alternative is an error message.
I'm trying to design a simple piece of code that will reasonably process
everything.
Of course, it can't be that simple since there are two incompatible Q encodings
and
other such cruft. But I really don't want to have to parse every single type of
header because it's pretty much all text from the scan point of view.
Jon