bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Zoltán Herczeg
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sun, 21 Sep 2014 08:46:39 +0200 (CEST)

Hi,

I am the developer of the JIT compiler in PCRE. I am frequently checking the 
discussions about PCRE and found this comment here on address@hidden:

> There's another way: fix libpcre so that it works on arbitrary binary data, 
> without the need for prescreening
> the data. That's the fundamental problem here. 

This requires too much effort with no benefit. Reasons:

- what should you do if you encounter an invalid UTF-8 opcode: ignore it? 
decode it to some random value? For example, what should happen if you find a 
stray 0xe9? Does it match \xe9? Everybody has different opinion about handling 
invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev.

- the bigger problem is performance. Handling invalid UTF codes require a lot 
of extra checks and kills many optimizations. For example, when we encounter a 
0xc5, we know that the input buffer has at least one more byte. We did not 
check the input buffer size. We also assume that the highest 2 bits are 10 for 
the second byte, and did not check this when we decode that character. This 
would also kill other optimizations like boyer-moore like search in JIT. The 
major problem is, everybody would suffer this performance regression, including 
those, who pass valid UTF strings.

Therefore such change will never happen due to these reasons.

But there are alternatives.

* The best solution is multi-threaded grepping: one thread reads file data, and 
replace/remove invalid UTF8 opcodes to something valid. The other thread runs 
PCRE on the filtered thread. Alternatively, you can convert everything to 
UTF32, and use pcre32.

* The other solution is improving PCRE survivability: if the buffer passed to 
PCRE has at least one zero character code before the invalid input buffer, and 
maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the 
buffer, we could guarantee that PCRE does not crash and PCRE does not enter 
infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the 
invalid UTF sequence contains ab, this might not be found (or might be found 
with interpreter, but not with JIT or vice versa). If you use pcre32, there is 
no need for any extra byte extension.

Regards,
Zoltan






reply via email to

[Prev in Thread] Current Thread [Next in Thread]