bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 26 Sep 2014 01:48:00 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán Herczeg wrote:

Just consider these two examples, where \x9c is an incorrectly encoded unicode 
codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error. Encoding errors match only themselves, not parts of other characters. That is how the glibc matchers behave, and it's what users expect.

Noticing errors during a backward scan is complicated.

It's doable, and it's the right thing to do.

/[\x9c-\x{ffff}]/

What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX. For PCRE you can do what you like. I suggest mapping encoding-error bytes into characters outside the Unicode range; that's what Emacs does, I think, and it's simple and easy to explain to users. It's not a big deal either way.

What kind of invalid and valid UTF byte sequences are inside (and outside) the 
bounds?

Just treat encoding-error bytes like everything else. In effect, extend the encoding to allow any byte sequence, and add a few "characters" outside the Unicode range, one for each invalid UTF-8 byte.

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or 
\xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't have case. They match only themselves.

> What unicode properties does an invalid codepoint have?

The minimal ones.

depending on their needs, everybody has different answers to these questions.

That's fine. Just implement reasonable defaults, and provide options if people have needs that differ from the defaults. That's easier for libpcre than for grep, since libpcre users (who are programmers) can reasonably be expected to be more sophisticated about this sort of thing than grep users (who are not necessarily programmers).

Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to assume the input is valid UTF-8. But again, we're not talking about the PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talking about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether the input is valid UTF-8, and currently does so inefficiently. In the non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go, than to have a prepass that checks for UTF-8. This is because the prepass must be stupid (it must check the entire input buffer) whereas the matcher can be smart (it often can do its work without checking the entire input buffer). This is one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in the non-PCRE_NO_UTF8_CHECK case, without hurting performance in the PCRE_NO_UTF8_CHECK case. But it could be done, if someone had the time to do it.

The question is, who would be willing to do this work.

Not me.  :-)

That would chew up CPU resources unnecessarily

Yeah but you could add a flag to enable this :)

I'm not sure it'd be popular to add a --drain-battery option to grep. :)

The use case that prompted
this bug report is someone using 'grep -r' to search for strings like
'foobar' in binary data, and this use case would not work with this
suggested solution.

In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of <http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated UTF-8 patterns in binary data. For example, I expect the user wants the pattern 'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file. So he can't simply use unibyte processing.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]