bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Vincent Lefevre
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Fri, 12 Sep 2014 03:24:49 +0200
User-agent:	Mutt/1.5.23-6361-vl-r59709 (2014-07-25)

With the patch that fixes bug 18266, grep -P works again on binary
files (with invalid UTF-8 sequences), but it is now significantly
slower than old versions (which could yield undefined behavior).

Timings with the Debian packages on my personal svn working copy
(binary + text files):

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

On this example, that's a 13x slowdown! Though the performance issue
would better be fixed in libpcre3, I suppose that it is not so simple
and won't occur any time soon. Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
   (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
   at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

3. Transform invalid bytes to null bytes in-place before the PCRE
   call. This changes the current semantic, but:
   * the semantic on invalid bytes has never been specified, AFAIK;
   * the best *practical* behavior may not be the current one
     (I personally prefer to be able to match invalid bytes, just
     like one can match top-bit-set characters in the C locale, and
     seeing such invalid bytes as equivalent to null bytes would
     not be a problem for most users, IMHO -- things can also be
     configurable).

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre <=
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/11
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Eric Blake, 2014/09/17

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#17494: "GREP_OPTIONS" allows a user to break shell script
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread