[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new sc
From: |
Earl Hood |
Subject: |
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan |
Date: |
Sun, 14 Nov 2010 22:20:55 -0600 |
On Sun, Nov 14, 2010 at 9:00 PM, Jon Steinhart wrote:
>> IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
>> by an RFC 822 parser. As a consequence, unencoded white space
>> characters (such as SPACE and HTAB) are FORBIDDEN within an
>> 'encoded-word'. For example, the character sequence
>>
>> =?iso-8859-1?q?this is some text?=
>>
>> would be parsed as four 'atom's, rather than as a single 'atom' (by
>> an RFC 822 parser) or 'encoded-word' (by a parser which understands
>> 'encoded-words'). The correct way to encode the string "this is some
>> text" is to encode the SPACE characters as well, e.g.
>>
>> =?iso-8859-1?q?this=20is=20some=20text?=
>
> Well sure, that's in the RFC but it doesn't really make a lot of sense to me.
It does, from a historical perspective of email. RFC 822 was the defacto
standard, so the MIME specification attempted to not break 822 and play
nice with systems that were not MIME-aware. Also, the MIME spec writers
saw MIME as something that could be added on to existing systems w/o
requiring re-implementation of mail parsing code.
Therefore, as the note states, spaces cannot appear inside coded words
since a MIME implementation sitting on top of an existing 822 parsing
system would never see the full encoded string since the 822 tokenizer
would have broken it up into separate tokens due to the space.
> Would be way more sensible in my opinion to decode everything and then parse
> it
> as it would eliminate a zillion special cases in RFC-land.
In general, non-ASCII encoded data only occurs in a few header fields,
mainly those that are entered/edited by the user, like subject and
recipient fields.
> And, the fact that
> you can't have an encoded word for H next to an encoded word for I to make HI
> just leads to to the RFC2231 ugliness. In any case, they chose to do it the
> overly complex way. But my question is really what do do when somebody sends
> me this:
>
> =?iso-8859-1?q?this is some text?=
A MIME compliant system would leave it untouched since it is not
a valid encoded word.
If nmh's core parsing engine was a RFC 822 (or 2822) tokenizer, and
the the MIME parsing works against 822 tokens (as what was the expected
implementation of MIME) then the MIME layer nevers sees the the
full, bad, encoded sequence (since it was broken up).
> Seems more sensible to treat the whole thing as an encoded word and to decode
> it.
> Are you suggesting that I should just treat it as text and not decode it?
The answer somewhat depends on how you implement things. If you implement
things in the manner of 822-parsing pass, then MIME parsing pass, the
above invalid string would not get decoded. If you are short-cutting
full 822 parsing, then the decision is not as clear. Of course such
short-cutting may make your parsing not deal with other things that
are 822 legal. IIRC, full 822 parsing is kind of ugly. I think 2822
attempted to simplify some things, but I have not looked into
2822 in depth.
>> As for space between encoded word, such space should be
>> collapsed. I.e. Two adjacent encoded words should be
>> concatenated together after decoding, with no space between
>> them.
>
> Where in what RFC do you find this. RFC2047 section 5, (1) says that encoded
> words must be separated from each other by linear white space but doesn't say
> that that white space is later removed.
The RFC says the encoded words must be limited in length (75 chars, including
the encoding meta-chars--e.g. =?...?=). If a given set of text is very
long, then it must be split into multiple encoded words that are folded
(CRLF SP). Hence, during the 822 parsing pass, each split word would
become its own token (since there is a CRLF SP separator), so inorder
to correctly reconstruct the original, unencoded text, the two tokens
must be concatenated together after decoding.
> Hmm. Where in what RFC is this prohibited? I'll agree that it doesn't make
> a whole lot of sense to have so many mechanisms that do the same thing, but
> what
> harm would come from this if it was decoded properly.
Section 5 of the RFC iterates where encoded words are allowed.
> So once again, I'm not asking what is proper when encoding a message. I'm
> asking
> for guidance on sensible behavior when decoding an improperly encoded message.
Ah, the liberal in what you accept mantra. IMO, a bad policy that time
and experience has shown us, especially from a computer security perspective.
It also allowed bad implementations to get away with things that should
have been corrected from day one.
The ultimate answer is how often such bad data occurs in the real world.
If it is rare, I do not see it is worth the effort to complicate
parsing code to deal with it, especially when not dealing with it
will not really break anything (i.e. the mail message can still be
read, nmh will not crash, etc).
> I'm unaware of any cases where the character sequences for encoded words would
> appear in any properly formatted items such as dates or addresses. So it
> seems
> to me that no harm would be done if I decoded such illegal stuff anyway as the
> alternative is an error message.
Actually, non-ASCII encoded words can occur in address fields, mainly
in the comments, like how comments in address fields are used to
show the human names of the associated addresses.
Of course, they are not allowed in the address itself.
> I'm trying to design a simple piece of code that will reasonably process
> everything.
> Of course, it can't be that simple since there are two incompatible Q
> encodings and
> other such cruft. But I really don't want to have to parse every single type
> of
> header because it's pretty much all text from the scan point of view.
If you have an 822 tokenizer, then you have section 5 of RFC 2047 which
tells which token types non-ASCII decoding can be done.
Section 5 even shows the modified ABNF grammer for 822 rules on
where an encoded word can appear.
I know I'm probably not giving you the answer you need, but the
answer ultimately depends on what kind of parser you plan on
implementing, with consideration of what the frequency of bad
data there is in the wild that nmh should deal with.
--ewh
P.S. As a point of experience, in my Perl program that parses
mail, including MIME mail, I initially took the easier route
of parsing header fields vs full tokenization. However, over
the years, the simplier parsing did not work for all, legitimate
cases, so I ended incorporating more robust, 822 parsing, on
select header fields to ensure correct behavior in my program.
There is a risk when trying to take shortcuts with header field
parsing. You may be to handle most cases just fine, but the
minority cases may cause problems, and then you end up trying
to hack/patch the parser to deal with them, which can get
ugly, fast.