bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Vincent Lefevre
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Thu, 18 Dec 2014 14:45:58 +0100
User-agent:	Mutt/1.5.23-6371-vl-r75100 (2014-11-04)

Sorry for the late reply.

On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote:
> On Fri, 28 Nov 2014 16:50:29 +0100
> Vincent Lefevre <address@hidden> wrote:
> > What matters is whether a sequence corresponds to a valid UTF-8
> > encoded Unicode character. My patch ensures that pcre_exec is called
> > on a string with only such characters, which implies that this is
> > also valid UTF-8 for PCRE (whether Unicode validity is also considered
> > in valid_utf8() or not). So, there's no valid reason why grep would
> > crash under such a condition.
> 
> It seems that PCRE treats e.g. following character as invalid.  It means
> we should not   these characters into pcre_exec with PCRE_NO_UTF8_CHECK
> option.
> 
>   0xE0 0xC2 0xFF
>   0xED 0xA0 0xFF
>   0xF0 0xBF 0xFF 0xFF

If I'm not mistaken, these first three are also treated as invalid by
my patch (and should be treated as invalid by any tool).

>   0xF4 0xBF 0xBF 0xBF

(corresponding to U+0013ffff).

Well, I followed some comment in the grep source, which is currently
incorrect.

pcreunicode(3) specifies that it follows RFC 3629, and that only
values in the range U+0 to U+10FFFF, excluding the surrogate area,
are allowed. I'll try to update my patch. But IMHO, it would be
better to get PCRE improved, and I had opened a bug:

  http://bugs.exim.org/show_bug.cgi?id=1554

BTW,

  printf "\xF4\xBF\xBF\xBF\n" | grep .

finds a match, and this appears to be a bug (grep should follow
the current standard).

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre <=
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19

Prev by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Previous by thread: bug#19388: grep 2.21-1 identifies iso encoded text files as binary
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread