bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Thu, 25 Sep 2014 18:19:20 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán, thanks for your comments on this subject. Some thoughts and suggestions:

- what should you do if you encounter an invalid UTF-8 opcode

Do whatever plain 'grep' does, which is what the glibc regular expression matcher does. If I recall correctly, an encoding error in the pattern matches the same encoding error in the string. It shouldn't be that complicated.

Everybody has different opinion about handling invalid UTF opcodes

I doubt whether users would care all that much, so long as the default is reasonable. We don't get complaints about it with 'grep', anyway. But if it's a real problem in the PCRE world, you could provide compile-time or run-time options to satisfy the different opinions.

everybody would suffer this performance regression, including those, who pass 
valid UTF strings.

I don't see why. libpcre can continue with its current implementation, for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; that's not a problem. The problem is the case where users pass possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre has a slow implementation for this case, and this slow implementation's performance should be improvable without affecting the performance for the PCRE_NO_UTF8_CHECK case.

* The best solution is multi-threaded grepping

That would chew up CPU resources unnecessarily, by requiring two passes over the input, one for checking UTF-8, the other for doing the actual match. Granted, it might be faster in real-time than what we have now, but overall it'd probably be more expensive (e.g., more energy consumption) than what we have now, and this doesn't sound promising.

* The other solution is improving PCRE survivability: if the buffer passed to 
PCRE has at least one zero character code before the invalid input buffer, and 
maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the 
buffer, we could guarantee that PCRE does not crash and PCRE does not enter 
infinite loops. Nothing else is guaranteed

That doesn't sound like a win, I'm afraid. The use case that prompted this bug report is someone using 'grep -r' to search for strings like 'foobar' in binary data, and this use case would not work with this suggested solution.


I'm hoping that the recent set of changes to 'grep' lessens the urgency of improving libpcre. On my platform (Fedora 20 x86-64) Jim Meyering's benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep -P is 6.4x slower than plain grep, and that with the latest experimental grep (including the patches I just posted in <http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep. So it's plausible that the latest set of fixes is good enough, in the sense that, sure, PCRE is slower, but it's always been slower and if that used to be good enough then it should still be good enough.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]