[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Zoltán Herczeg |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Sun, 21 Sep 2014 08:46:39 +0200 (CEST) |
Hi,
I am the developer of the JIT compiler in PCRE. I am frequently checking the
discussions about PCRE and found this comment here on address@hidden:
> There's another way: fix libpcre so that it works on arbitrary binary data,
> without the need for prescreening
> the data. That's the fundamental problem here.
This requires too much effort with no benefit. Reasons:
- what should you do if you encounter an invalid UTF-8 opcode: ignore it?
decode it to some random value? For example, what should happen if you find a
stray 0xe9? Does it match \xe9? Everybody has different opinion about handling
invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev.
- the bigger problem is performance. Handling invalid UTF codes require a lot
of extra checks and kills many optimizations. For example, when we encounter a
0xc5, we know that the input buffer has at least one more byte. We did not
check the input buffer size. We also assume that the highest 2 bits are 10 for
the second byte, and did not check this when we decode that character. This
would also kill other optimizations like boyer-moore like search in JIT. The
major problem is, everybody would suffer this performance regression, including
those, who pass valid UTF strings.
Therefore such change will never happen due to these reasons.
But there are alternatives.
* The best solution is multi-threaded grepping: one thread reads file data, and
replace/remove invalid UTF8 opcodes to something valid. The other thread runs
PCRE on the filtered thread. Alternatively, you can convert everything to
UTF32, and use pcre32.
* The other solution is improving PCRE survivability: if the buffer passed to
PCRE has at least one zero character code before the invalid input buffer, and
maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the
buffer, we could guarantee that PCRE does not crash and PCRE does not enter
infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the
invalid UTF sequence contains ab, this might not be found (or might be found
with interpreter, but not with JIT or vice versa). If you use pcre32, there is
no need for any extra byte extension.
Regards,
Zoltan
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, (continued)
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Santiago Ruano Rincón, 2014/09/18
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/18
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/19
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/25
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/27
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/27
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/28
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales,
Zoltán Herczeg <=
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/25
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/27
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/27
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/28
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/28
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/30