bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps


From: YAMAMOTO Mitsuharu
Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Date: Sat, 27 Jun 2009 10:30:10 +0900
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8) MULE/5.0 (SAKAKI)

>>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said:

>> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
>> Cc: 
>> 
>> The following results look inconsistent:
>> 
>> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> => 0
>> (string-match (string-to-multibyte "\x80") "\x80")
>> => nil
>> 
>> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> => nil
>> (string-match (string-to-multibyte "[\x80]") "\x80")
>> => 0

> Please tell why you think they are inconsistent.

I thought there's no room for argument about their inconsistency with
respect to the specification of "[...]" in regexps.

> More importantly, please show real-life examples of code or
> situations where this gets in your way.

If you decode some data containing invalid (undecodable) byte
sequences using a coding system such as utf-8, then such sequences are
embedded in the decoded result as eight-bit characters in multibyte
form.  You can detect particular such sequences by searching a
"characer alternative" regexp (or its multibyte form) in the decoded
result if it works.

Further examples that look inconsistent:

  (string-match (string-to-multibyte "[\x80\x81]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xbf]") (string-to-multibyte 
"\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\x80"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\xbf"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\xc0"))
  => nil

> This area is full of subtleties and gotchas, and in general the
> current code does what it does because it needs to cater to many
> different practical situations.

> There could still be bugs, of course.

Yeah.  I found another suspected bug in this area:

  (string-match "[[:unibyte:]]" "\x80")
  => nil
  (string-match "[[:unibyte:]]" (string-to-multibyte "\x80"))
  => nil

                                     YAMAMOTO Mitsuharu
                                mituharu@math.s.chiba-u.ac.jp





reply via email to

[Prev in Thread] Current Thread [Next in Thread]