bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 12 Sep 2014 17:59:41 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

This is still better than no optimization at all.

We'd have to see; not every optimization is worth the trouble.

if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.

It'd be better if there wasn't a new grep option simply to avoid a libpcre performance bug.

Could you give some reference?

The pcreunicode man page mentions some of this issue under "Validity of UTF-8 string". My impression is that the actual history of behavior changes is more complicated than what that simple summary would suggest.

This doesn't introduce undefined behavior, just a different
behavior

Again, it'd be better if grep Just Worked.

I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side.

libpcre is not giving 'grep' an efficient way to search data that can contain encoding errors. This does not mean "the problem is on the grep side".

I don't see how this
could be solved except by doing the UTF-8 check on the grep side.

There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening the data. That's the fundamental problem here.

I often want to take binary files into account

In those cases I suggest using a unibyte C locale.

I still want "." to match a single (valid) UTF-8 character.

How about this idea instead? Use a unibyte C locale, and write a unibyte regular expression C that matches a single valid UTF-8 character (using whatever definition you like for UTF-8). Then, you can use . to match single bytes and C to match characters. This gives you all the power you need, without the slowdown due to UTF-8 processing, a slowdown that will be inevitable no matter how we change grep or libpcre.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]