[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Paul Eggert |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Fri, 12 Sep 2014 17:59:41 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 |
Vincent Lefevre wrote:
This is still better than no optimization at all.
We'd have to see; not every optimization is worth the trouble.
if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.
It'd be better if there wasn't a new grep option simply to avoid a
libpcre performance bug.
Could you give some reference?
The pcreunicode man page mentions some of this issue under "Validity of
UTF-8 string". My impression is that the actual history of behavior
changes is more complicated than what that simple summary would suggest.
This doesn't introduce undefined behavior, just a different
behavior
Again, it'd be better if grep Just Worked.
I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side.
libpcre is not giving 'grep' an efficient way to search data that can
contain encoding errors. This does not mean "the problem is on the grep
side".
I don't see how this
could be solved except by doing the UTF-8 check on the grep side.
There's another way: fix libpcre so that it works on arbitrary binary
data, without the need for prescreening the data. That's the
fundamental problem here.
I often want to take binary files into account
In those cases I suggest using a unibyte C locale.
I still want "." to match a single (valid) UTF-8 character.
How about this idea instead? Use a unibyte C locale, and write a
unibyte regular expression C that matches a single valid UTF-8 character
(using whatever definition you like for UTF-8). Then, you can use . to
match single bytes and C to match characters. This gives you all the
power you need, without the slowdown due to UTF-8 processing, a slowdown
that will be inevitable no matter how we change grep or libpcre.
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18