[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Vincent Lefevre |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Fri, 12 Sep 2014 10:20:08 +0200 |
User-agent: |
Mutt/1.5.23-6361-vl-r59709 (2014-07-25) |
On 2014-09-11 19:53:23 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >Things could be done in grep:
> >
> >1. Ignore -P when the pattern would have the same meaning without -P
> > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> > at least for the simplest cases).
> >
> >2. Call PCRE in the C locale when this is equivalent.
>
> I had already considered these ideas along with several others, but they
> would require grep to parse and analyze the Perl regular expression. I
> don't know the PCRE syntax and it would take some time to write a parser.
> And even if I wrote one, the next PCRE release would likely change the
> syntax. It sounds very painful to maintain.
I think that (1) is rather simple, even though optimization could
be missed on some patterns: ERE and PCRE have a large equivalent
subclass. The pattern could be examined left to right and would
consist of:
- Normal characters.
- ".", "^" at the beginning, "$" (alone) at the end.
- [] with normal characters inside.
- "*", "+", "?", "{...}" form not followed by one of "*+?{".
- "|" and "(" not followed by one of these 4 characters.
- "\" followed by one of ".^$[*+?{".
- Some "\" + letter sequences could be recognised as well.
Something like that (I haven't checked carefully). There could be
another option to allow such an optimization or not.
> >3. Transform invalid bytes to null bytes in-place before the PCRE
> > call. This changes the current semantic, but:
> > * the semantic on invalid bytes has never been specified, AFAIK;
> > * the best *practical* behavior may not be the current one
>
> As we've already discussed, this would be incompatible with how invalid
> bytes are treated by other matchers.
The same thing could be done with other matchers (in an optional way).
> And would have undesirable practical effects, e.g., the pattern
> 'a..*b' would match data that would look like "ab" on many screens
> (because the null byte would vanish). It's a real kludge that will
> bite users.
But this is already the case:
$ printf "a\0b\n"
ab
$ printf "a\0b\n" | grep 'a..*b'
Binary file (standard input) matches
The transformation won't touch null bytes. It would just interpret
invalid bytes as null bytes, so that they get matched by ".".
> Even if we went along with the kludge, grep does not know what bytes
> PCRE considers to be invalid without invoking PCRE, which is what
> it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are
> different ways to do that and they don't all agree.)
It would be a bug in PCRE. Parsing UTF-8 is standard. This is
sumarized by:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
(from the Linux utf-8(7) man page), everything else being invalid.
Note that the pcre_exec(3) man page even say:
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8 validity
assuming that the check can be done on the user's side, i.e. in a
standard way.
> Here's a different idea. How about invoking grep with the
> --binary-files=without-match option? This should avoid much of the
> libpcre performance problem, without having to change 'grep'.
I often want to take binary files into account, for instance because
executables can contain text I search for (error messages...). There
may be other examples.
--
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18