emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ".*utf\\(-?8\\)\\>" versus ".*[._]utf" versus "address@hidden>"


From: Paul Eggert
Subject: Re: ".*utf\\(-?8\\)\\>" versus ".*[._]utf" versus "address@hidden>"
Date: Fri, 21 Dec 2001 15:48:03 -0800 (PST)

> From: Dave Love <address@hidden>
> Date: 21 Dec 2001 15:15:25 +0000
> 
> >>>>> Paul Eggert writes:
> 
>  > The regular expression ".*utf\\(-?8\\)\\>" in
>  > locale-charset-language-names seems to be inconsistent with the
>  > regular expression ".*[._]utf" in locale-preferred-coding-systems.
>  > Shouldn't one or the other regular expression (or both) be changed?
> 
>  > I think the regular expression should contain [._]; I'm not so sure
>  > about the \\(-?8\\)\\> part.
> 
> The utf-8 part is consistent with the other entries, isn't it?

No, the preceding entry "address@hidden>" has a delimiter, and the other
entries (e.g. ".*8859[-_]?1\\>") are special cases because ISO 8859
locale names in practice could have almost anything before the '8859'.

> I assume it's appropriate to match a specification of simply `utf-8'
> to set up the generic utf-8 language environment, like `iso-8859-1' & al.

I've never seen a locale by that name, and I doubt whether we'll run
into one.  Locale names like 'iso_8859_1' are still around for
backward compatibility reasons, but modern locale names give more than
just the character encoding.

> I've seen suggestions that `utf' is sometimes used as a synonym for
> `utf-8'; obviously I should have noted the source.

Likewise.

>  > Also, locale-charset-language-names ends with this:
> 
>  >      ("address@hidden>" . "Latin-9")
>  >      (".*utf\\(-?8\\)\\>" . "UTF-8")))
> 
>  > Shouldn't the UTF-8 pattern come before the euro pattern and the other
>  > patterns?  It seems to me that the current order mishandles locales
>  > like "address@hidden", which are present on Solaris 8.
> 
> I guess so.  I think I just added it to the end without considering
> the issue.  You're the expert.

Not really, but thanks for the compliment.  How about the following
patch?  It regularizes this stuff.

Should this patch be applied to both the development version and to
the Emacs 21 branch?

2001-12-12  Paul Eggert  <address@hidden>

        * international/mule-cmds.el (locale-charset-language-names):
        Put UTF-8 pattern first, so that it dominates the @euro pattern.
        (locale-preferred-coding-systems): For consistency with
        locale-charset-language-names, put UTF-8 first and use the
        same UTF-8 pattern.  For internal consistency, append \> to
        all patterns.

--- mule-cmds.el        Wed Dec 19 15:21:44 2001
+++ /tmp/mule-cmds.el   Fri Dec 21 15:46:08 2001
@@ -1854,15 +1854,15 @@ If the language name is nil, there is no
 
 (defconst locale-charset-language-names
   (purecopy
-   '((".*8859[-_]?1\\>" . "Latin-1")
+   '((".*utf[-_]?8\\>" . "UTF-8")
+     (".*8859[-_]?1\\>" . "Latin-1")
      (".*8859[-_]?2\\>" . "Latin-2")
      (".*8859[-_]?3\\>" . "Latin-3")
      (".*8859[-_]?4\\>" . "Latin-4")
      (".*8859[-_]?9\\>" . "Latin-5")
      (".*8859[-_]?14\\>" . "Latin-8")
      (".*8859[-_]?15\\>" . "Latin-9")
-     ("address@hidden>" . "Latin-9")
-     (".*utf\\(-?8\\)\\>" . "UTF-8")))
+     ("address@hidden>" . "Latin-9")))
   "List of pairs of locale regexps and charset language names.
 The first element whose locale regexp matches the start of a downcased locale
 specifies the language name whose charsets corresponds to that locale.
@@ -1871,11 +1871,11 @@ the language name that would otherwise b
 
 (defconst locale-preferred-coding-systems
   (purecopy
-   '(("ja.*[._]euc" . japanese-iso-8bit)
-     ("ja.*[._]jis7" . iso-2022-jp)
-     ("ja.*[._]pck" . japanese-shift-jis)
-     ("ja.*[._]sjis" . japanese-shift-jis)
-     (".*[._]utf" . utf-8)))
+   '((".*[._]utf[-_]?8\\>" . utf-8)
+     ("ja.*[._]euc\\>" . japanese-iso-8bit)
+     ("ja.*[._]jis7\\>" . iso-2022-jp)
+     ("ja.*[._]pck\\>" . japanese-shift-jis)
+     ("ja.*[._]sjis\\>" . japanese-shift-jis)))
   "List of pairs of locale regexps and preferred coding systems.
 The first element whose locale regexp matches the start of a downcased locale
 specifies the coding system to prefer when using that locale.")



reply via email to

[Prev in Thread] Current Thread [Next in Thread]