bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Norihiro Tanaka
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Fri, 19 Dec 2014 23:00:38 +0900

On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <address@hidden> wrote:
> > 
> >   0xE0 0xC2 0xFF
> >   0xED 0xA0 0xFF
> >   0xF0 0xBF 0xFF 0xFF
> 
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).

I got them from pcre_valid_utf8(), but I made some mistakes.  They are
as following.

  0xE0 0xAF 0xBF
  0xED 0xA0 0xBF
  0xF0 0x8F 0xBF 0xBF

By the way, they are correspond with following codes in pcre_valid_utf8().

    if (c == 0xe0 && (d & 0x20) == 0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR16;
      }
    if (c == 0xed && d >= 0xa0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR14;
      }

    ........

    if (c == 0xf0 && (d & 0x30) == 0)
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR17;
      }
    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR13;
      }

> BTW,
> 
>   printf "\xF4\xBF\xBF\xBF\n" | grep .
> 
> finds a match, and this appears to be a bug (grep should follow
> the current standard).

I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
for the sequence.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/18
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka <=
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19

Prev by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread