bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps

From:	YAMAMOTO Mitsuharu
Subject:	bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Date:	Sat, 27 Jun 2009 10:30:10 +0900
User-agent:	Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8) MULE/5.0 (SAKAKI)

>>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said:

>> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
>> Cc: 
>> 
>> The following results look inconsistent:
>> 
>> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> => 0
>> (string-match (string-to-multibyte "\x80") "\x80")
>> => nil
>> 
>> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> => nil
>> (string-match (string-to-multibyte "[\x80]") "\x80")
>> => 0

> Please tell why you think they are inconsistent.

I thought there's no room for argument about their inconsistency with
respect to the specification of "[...]" in regexps.

> More importantly, please show real-life examples of code or
> situations where this gets in your way.

If you decode some data containing invalid (undecodable) byte
sequences using a coding system such as utf-8, then such sequences are
embedded in the decoded result as eight-bit characters in multibyte
form.  You can detect particular such sequences by searching a
"characer alternative" regexp (or its multibyte form) in the decoded
result if it works.

Further examples that look inconsistent:

  (string-match (string-to-multibyte "[\x80\x81]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xbf]") (string-to-multibyte 
"\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\x80"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\xbf"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte 
"\xc0"))
  => nil

> This area is full of subtleties and gotchas, and in general the
> current code does what it does because it needs to cater to many
> different practical situations.

> There could still be bugs, of course.

Yeah.  I found another suspected bug in this area:

  (string-match "[[:unibyte:]]" "\x80")
  => nil
  (string-match "[[:unibyte:]]" (string-to-multibyte "\x80"))
  => nil

                                     YAMAMOTO Mitsuharu
                                mituharu@math.s.chiba-u.ac.jp

[Prev in Thread]

Current Thread

[Next in Thread]

bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, YAMAMOTO Mitsuharu, 2009/06/26
- bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, Eli Zaretskii, 2009/06/26
  - bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, YAMAMOTO Mitsuharu <=
    - bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, Eli Zaretskii, 2009/06/27
    - bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, YAMAMOTO Mitsuharu, 2009/06/28
    - bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps, Stefan Monnier, 2009/06/29

Prev by Date: bug#3643: minibuffer beyond end of screen in emacs23
Next by Date: bug#3619: Replace Rectangle can't do empty after first replace anymore
Previous by thread: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Next by thread: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Index(es):
- Date
- Thread