bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps


From: YAMAMOTO Mitsuharu
Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Date: Fri, 24 Jul 2009 10:08:11 +0900
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8) MULE/5.0 (SAKAKI)

>>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier 
>>>>> <monnier@iro.umontreal.ca> said:

>> It seemed to be too obvious to explain and I hesitated to do that.
>> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
>> character C has no special meaning in either context.

> Yes, it's pretty obvious, thank you.  I haven't had time to look
> deeper, but that part of the code is pretty nasty because it tries
> to be clever about the fact that values between 128-256 can be
> either latin-1 chars and eight-bit-bytes and it tries to be lenient
> about confusion between the two.

Are there any written specifications explaining how the leniency is
supposed to work?

As for documentations, the description below in the elisp info
(Special Characters in Regular Expressions) probably needs to be
updated.

     The beginning and end of a range of multibyte characters must be in
     the same character set (*note Character Sets::).  Thus,
     `"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with
     grave accent) is in the Emacs character set for Latin-1 but the
     character 0x97c (`u' with diaeresis) is in the Emacs character set
     for Latin-2.  (We use Lisp string syntax to write that example,
     and a few others in the next few paragraphs, in order to include
     hex escape sequences in them.)

     If a range starts with a unibyte character C and ends with a
     multibyte character C2, the range is divided into two parts: one
     is `C..?\377', the other is `C1..C2', where C1 is the first
     character of the charset to which C2 belongs.

     You cannot always match all non-ASCII characters with the regular
     expression `"[\200-\377]"'.  This works when searching a unibyte
     buffer or string (*note Text Representations::), but not in a
     multibyte buffer or string, because many non-ASCII characters have
     codes above octal 0377.  However, the regular expression
     `"[^\000-\177]"' does match all non-ASCII characters (see below
     regarding `^'), in both multibyte and unibyte representations,
     because only the ASCII characters are excluded.

                                     YAMAMOTO Mitsuharu
                                mituharu@math.s.chiba-u.ac.jp





reply via email to

[Prev in Thread] Current Thread [Next in Thread]