nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] What should nhm do with busted Subject: lines?


From: Ken Hornstein
Subject: Re: [Nmh-workers] What should nhm do with busted Subject: lines?
Date: Wed, 04 Nov 2015 19:08:38 -0500

>So I got an e-mail from an Outlook abuser that had some UTF-8 smart
>quote characters in the Subject: line - sans RFC2047 encoding, just
>bare UTF-8 characters, naked as the day they were typed, plonked in the
>middle of the line.
>
>What *should* nmh do here (given that we don't have a way to tell it
>was UTF-8 versus an ISO8859-N or 2022 or what-have-you)?

Technically ... those are legal nowadays.  See RFC 6532.  That's a
message/global message.

What should we do?  We should deal with it.  I think we might not do so
well right now.  Okay, fine, what does 'deal with it' mean?  Well ...
technically the only valid 'raw' 8-bit characters in headers are UTF-8.
But I am aware that some busticated MUAs still send raw 8-bit data in
other character sets.

I see two possible sets of ways to deal with it better:

1) Assume any unencoded 8-bit characters in email headers are UTF-8.  Treat
   as UTF-8, which means converting to local character set if necessary.
   If it turns out those bytes are not UTF-8, then either they'll fail
   character conversion or end up as mojibake on a user's terminal (well,
   they'll probably end up as the UTF-8 invalid character).

2) Do 1), except check first to see if all of the 8-bit sequences are
   valid UTF-8 encoding (it's possible for an arbitrary sequence of
   8-bit characters to be a valid UTF-8 encoded sequence, but very unlikely).
   If it's all valid, treat as 1).  Otherwise use substitution characters
   for everything 8-bit.

--Ken



reply via email to

[Prev in Thread] Current Thread [Next in Thread]