pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-users] Re: How to set character encoding for the Header pane?


From: Per Hedeland
Subject: Re: [Pan-users] Re: How to set character encoding for the Header pane?
Date: Sat, 24 Feb 2007 01:24:25 +0100 (CET)

Duncan <address@hidden> wrote:
>
>Internet message headers as used in both mail and news were originally
>specified as 7-bit US-ASCII/ANSI, and for backward compatibility and
>interoperability, that remains the case at the "raw" level.  The problem is
>that while the standards were later adapted to allow other charsets in the
>message body, with the MIME standards (see RFCs 2041-2045)

I'm afraid your numbers are bit off - e.g. 2041 is "Mobile Network
Tracing" etc. The MIME RFCs are 2045 ("Multipurpose Internet Mail
Extensions (MIME) Part One") to 2049 ("Multipurpose Internet Mail
Extensions (MIME) Part Five"), though 2045-2047 are the actual format
standards. The purpose of this apparent nitpicking will be obvious in a
moment.:-)

>There is however a newer (I believe) and less formal (I'm not sure how far
>it has gotten in the RFC standardization process) workaround.  I've never
>found the details useful enough to take the necessary time to grok, but in
>somewhat vague and likely not entirely correct terms, a header can include
>an inline ISO-standard charset reference and then be encoded into it.

I'm not sure why you think this is of any lesser "weight" than the
others, it is in fact one in the sequence that you *should* have given
above, namely 2047 - and it was one of the two original MIME RFCs (1522
in the set of 1521+1522), and it's certainly as formal as any standards-
track RFC. FWIW, all of 2045-2047 and 2049 are "DRAFT STANDARD".

But anyway, the header is not "encoded into the charset", it *is* in the
charset given by the reference, and encoded into Quoted-"Printable" or
Base64, just like the body may be (there are some minor differences in
the encoding rules vs those that are used for the body). In effect it's
the equivalent of the Content-Type + Content-Transfer-Encoding headers +
the body, but for a single header line, and all contained in that single
header line. The main difference is that the header *must* (according to
the RFCs:-) be encoded to fulfill the 7-bit-only requirement, while the
body may in some cases be unencoded (but labeled via C-T) 8-bit with a
'C-T-E: 8bit' header.
 
>The problem with implementing this is that because it's newer and less
>formal, various implementations have minor or not so minor differences, and
>aren't necessarily entirely interoperable and compatible with one another.

Hm, well, it's a very simple standard, though I think I recall seeing
some mention of some client getting the white-space/word-break handling
wrong - that's just a plain bug though.

>In fact, according to what I've read (being I've never had reason to
>personally find out), it's not uncommon at all for various "foreign"
>newsgroups and mailing lists to omit the ISO-xxxx or whatever bit entirely,
>and simply assume that all headers are encoded in the "native" charset.

True, and not only that, we may even have "native" charset in the body
without proper Content-Type and Content-Transfer-Encoding headers - the
horror!:-) But I think both this and the common non-2047-compliance of
headers in News is due to a) (as you mention) it's largely superfluous
since the language used in the newsgroup is an implicit indication of
what charset to expect, b) no significant part of Usenet has (or had)
any problems with 8-bit headers or content, and last but not least c)
there has never been a formal standardization on the use of MIME for
Usenet messages (it could probably be argued that there has never been a
formal standardization of the format of Usenet messages at all...). Item
c) might change (see (currently) draft-ietf-usefor-usefor-12.txt), but
I'm not holding my breath - the working group can soon celebrate its
10th anniversary without having managed to produce an RFC.:-)

>With that background, it's not unexpected that pan would have problems in
>the area.  Even the best clients will have occasional problems in this area,
>due as I said to the immature/incomplete standardization and resulting
>partially incompatible implementations.

Per above, I would say that it's not a matter of immature/incomplete
standardization of the "MIME header format" (RFC 2047), but of
immature/incomplete standardization of the Usenet message format in
general. I.e. you can't really claim that a News reader is buggy for not
being MIME-compliant - but on the other hand a good reader (like Pan) is
expected to be able to deal with all kinds of non-standard encodings for
"binary" message bodies, so it might as well do the comparatively
straightforward MIME thing too (even though it's quite unusual for
"binaries" in News).

Anyway, FWIW, Pan (0.122) seems to have no problems decoding headers
such as

Subject: 
=?iso-8859-1?q?Re:_19"_sk=E4rm_1440x900_p=E5_Linux_Fedora_Core5_-_HJ=C4LP!!!?=

(from <address@hidden>), without me
telling it anything in particular. This is just "simple" 8-bit
iso-8859/x though, things may be different with multibyte charsets (I
can't determine that, since I have no way of knowing what the result of
the decoding is supposed to look like). Well, utf-8 which is two bytes
per character for the non-ascii part of iso-8859/x seems to work fine
too, but of course that's still one byte per actual character. And there
we get nice stuff like

Subject: Re: =?utf-8?Q?Skapa_filsystem_f?=
 =?utf-8?Q?=C3=B6r_DVD_med_filer_p?=
 =?utf-8?Q?=C3=A5_4_GB?=
From: =?utf-8?Q?Daniel_Sp=C3=A5ngberg?= <address@hidden>

(from <address@hidden>) -
quoted-*printable* is really a joke at this point, though admittedly
that Subject line demonstrates a particularly moronic way of doing the
encoding.

--Per Hedeland




reply via email to

[Prev in Thread] Current Thread [Next in Thread]