[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: undecided vs utf-8
From: |
Lars Magne Ingebrigtsen |
Subject: |
Re: undecided vs utf-8 |
Date: |
Fri, 05 Nov 2010 03:32:02 +0100 |
User-agent: |
Gnus/5.110011 (No Gnus v0.11) Emacs/24.0.50 (gnu/linux) |
Kenichi Handa <address@hidden> writes:
> It's perhaps because you are in some of iso-8859-1 locale.
I don't think I am, but I might be wrong. There are so many locale
variables, but I always try to put my machines into "C" locale.
> I don't want to add such a heuristic in
> decode-coding-string/region (the lowest functions available
> from Lisp). Please note that above sequence is also valid
> as Big5. If people are in Big5 locale, it's hard to answer
> which of utf-8 or big5 is preferred unless we implement NLP
> system.
I don't know how the big5 encoding looks like, but when it comes to
iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid
iso-8859-1 strings, but there are few iso-8859-1 strings that are valid
utf-8 strings. Therefore it seems to make sense to prefer utf-8 over
iso-8859-1. Perhaps.
> Perhaps making an upper layer function that will accept a
> list of preferred coding systems will be good; something
> like this.
>
> (defun detect-and-decode-coding-string (str preferred)
> (let ((detected (detect-coding-string str))
> decided)
> (while (and preferred (not decided))
> (if (memq (car preferred) detected)
> (setq decided (car preferred))
> (setq preferred (cdr preferred))))
> (decode-coding-string str (or decided (car detected)))))
Well, this is about `undecided', and the C layer does DWIM-ish
processing when you ask it to decode `undecided', doesn't it?
The use case that made me look into this -- erc -- is somewhat special.
The irc protocol does no charset tagging, and some clients send some
charsets, and some send others, which is why erc uses `undecided' as the
default coding system. Typically on a channel you'll see somebody using
a local (iso-8859-* is popular) charset, and others using utf-8.
Perhaps the fix here isn't to do anything with `undecided' per se, but
just fix erc. It's trivial enough -- just have the default be, say,
`undecided-or-utf-8', and then handle that by running
`detect-coding-string' over it, see whether it's utf-8, and then either
use that or pass `undecided' down into the decoding functions.
I don't know. What do you think?
--
(domestic pets only, the antidote for overdose, milk.)
address@hidden * Lars Magne Ingebrigtsen
- undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/04
- Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/04
- Re: undecided vs utf-8, Kenichi Handa, 2010/11/04
- Re: undecided vs utf-8,
Lars Magne Ingebrigtsen <=
- Re: undecided vs utf-8, Kenichi Handa, 2010/11/05
- Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
- Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05
- Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
- Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05
- Re: undecided vs utf-8, Deniz Dogan, 2010/11/05
- Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
- Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05