bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sun, 28 Sep 2014 08:09:33 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán Herczeg wrote:

For me the question is whether binary search needs to supported on PCRE level.

It's purely a performance question. GNU grep already uses libpcre to search binary data, and it works now. It's just slow, that all. I'm willing to live with this, and tell users "Sorry, but libpcre is not designed to search binary data quickly; if you want speed then don't use grep's -P option." If you're willing to live with this too, we're done.

removing a lot of optimizations.

You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. Keep them all. It should be just as fast before. The idea is to have one matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it goes). The former matcher will be just as fast as now, and the latter matcher will be faster than what libpcre has now. I readily concede that this will require some nontrivial coding, but I don't concede that it will remove optimizations or make libpcre slower. It should make libpcre faster; that's the point.

You have a 100 byte long buffer, and you start matching from byte 50.

Grep already does that sort of thing. And it's smart enough to start matching only at character boundaries. It's not libpcre's job to worry about this; the caller can worry about it.

For me this is way too much checks, and affects compiler optimizations too much.

The code you posted could be made faster than that; among other things there should not be an unbounded backward scan. And even the code you posted would often be faster than what's in libpcre now. That early UTF-8 validity prepass is a killer.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]