bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Thu, 25 Sep 2014 18:19:20 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán, thanks for your comments on this subject. Some thoughts andsuggestions:

- what should you do if you encounter an invalid UTF-8 opcode

Do whatever plain 'grep' does, which is what the glibc regularexpression matcher does. If I recall correctly, an encoding error inthe pattern matches the same encoding error in the string. It shouldn'tbe that complicated.

Everybody has different opinion about handling invalid UTF opcodes

I doubt whether users would care all that much, so long as the defaultis reasonable. We don't get complaints about it with 'grep', anyway.But if it's a real problem in the PCRE world, you could providecompile-time or run-time options to satisfy the different opinions.

everybody would suffer this performance regression, including those, who pass 
valid UTF strings.

I don't see why. libpcre can continue with its current implementation,for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK;that's not a problem. The problem is the case where users passpossibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre hasa slow implementation for this case, and this slow implementation'sperformance should be improvable without affecting the performance forthe PCRE_NO_UTF8_CHECK case.

* The best solution is multi-threaded grepping

That would chew up CPU resources unnecessarily, by requiring two passesover the input, one for checking UTF-8, the other for doing the actualmatch. Granted, it might be faster in real-time than what we have now,but overall it'd probably be more expensive (e.g., more energyconsumption) than what we have now, and this doesn't sound promising.

* The other solution is improving PCRE survivability: if the buffer passed to 
PCRE has at least one zero character code before the invalid input buffer, and 
maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the 
buffer, we could guarantee that PCRE does not crash and PCRE does not enter 
infinite loops. Nothing else is guaranteed

That doesn't sound like a win, I'm afraid. The use case that promptedthis bug report is someone using 'grep -r' to search for strings like'foobar' in binary data, and this use case would not work with thissuggested solution.

I'm hoping that the recent set of changes to 'grep' lessens the urgencyof improving libpcre. On my platform (Fedora 20 x86-64) Jim Meyering'sbenchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep-P is 6.4x slower than plain grep, and that with the latest experimentalgrep (including the patches I just posted in<http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep.So it's plausible that the latest set of fixes is good enough, in thesense that, sure, PCRE is slower, but it's always been slower and ifthat used to be good enough then it should still be good enough.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, (continued)
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Santiago Ruano Rincón, 2014/09/18
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/18
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/25
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/27
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/27
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/28
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/22
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/27
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/27
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/28
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/28
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/30
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/30

Prev by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread