bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Vincent Lefevre
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Fri, 12 Sep 2014 10:20:08 +0200
User-agent:	Mutt/1.5.23-6361-vl-r59709 (2014-07-25)

On 2014-09-11 19:53:23 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >Things could be done in grep:
> >
> >1. Ignore -P when the pattern would have the same meaning without -P
> >    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> >    at least for the simplest cases).
> >
> >2. Call PCRE in the C locale when this is equivalent.
> 
> I had already considered these ideas along with several others, but they
> would require grep to parse and analyze the Perl regular expression.  I
> don't know the PCRE syntax and it would take some time to write a parser.
> And even if I wrote one, the next PCRE release would likely change the
> syntax.  It sounds very painful to maintain.

I think that (1) is rather simple, even though optimization could
be missed on some patterns: ERE and PCRE have a large equivalent
subclass. The pattern could be examined left to right and would
consist of:
  - Normal characters.
  - ".", "^" at the beginning, "$" (alone) at the end.
  - [] with normal characters inside.
  - "*", "+", "?", "{...}" form not followed by one of "*+?{".
  - "|" and "(" not followed by one of these 4 characters.
  - "\" followed by one of ".^$[*+?{".
  - Some "\" + letter sequences could be recognised as well.

Something like that (I haven't checked carefully). There could be
another option to allow such an optimization or not.

> >3. Transform invalid bytes to null bytes in-place before the PCRE
> >    call. This changes the current semantic, but:
> >    * the semantic on invalid bytes has never been specified, AFAIK;
> >    * the best *practical* behavior may not be the current one
> 
> As we've already discussed, this would be incompatible with how invalid
> bytes are treated by other matchers.

The same thing could be done with other matchers (in an optional way).

> And would have undesirable practical effects, e.g., the pattern
> 'a..*b' would match data that would look like "ab" on many screens
> (because the null byte would vanish). It's a real kludge that will
> bite users.

But this is already the case:

$ printf "a\0b\n"
ab
$ printf "a\0b\n" | grep 'a..*b'
Binary file (standard input) matches

The transformation won't touch null bytes. It would just interpret
invalid bytes as null bytes, so that they get matched by ".".

> Even if we went along with the kludge, grep does not know what bytes
> PCRE considers to be invalid without invoking PCRE, which is what
> it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are
> different ways to do that and they don't all agree.)

It would be a bug in PCRE. Parsing UTF-8 is standard. This is
sumarized by:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

(from the Linux utf-8(7) man page), everything else being invalid.

Note that the pcre_exec(3) man page even say:

    PCRE_NO_UTF8_CHECK  Do not check the subject for UTF-8 validity

assuming that the check can be done on the user's side, i.e. in a
standard way.

> Here's a different idea. How about invoking grep with the
> --binary-files=without-match option? This should avoid much of the
> libpcre performance problem, without having to change 'grep'.

I often want to take binary files into account, for instance because
executables can contain text I search for (error messages...). There
may be other examples.

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/11
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/11
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre <=
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Eric Blake, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18

Prev by Date: bug#18455: grep 2.20 perl-regexp: invalid UTF-8 byte sequence in input
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread