bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Fri, 12 Sep 2014 17:59:41 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

This is still better than no optimization at all.


We'd have to see; not every optimization is worth the trouble.

if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.

It'd be better if there wasn't a new grep option simply to avoid alibpcre performance bug.

Could you give some reference?

The pcreunicode man page mentions some of this issue under "Validity ofUTF-8 string". My impression is that the actual history of behaviorchanges is more complicated than what that simple summary would suggest.

This doesn't introduce undefined behavior, just a different
behavior


Again, it'd be better if grep Just Worked.

I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side.

libpcre is not giving 'grep' an efficient way to search data that cancontain encoding errors. This does not mean "the problem is on the grepside".

I don't see how this
could be solved except by doing the UTF-8 check on the grep side.

There's another way: fix libpcre so that it works on arbitrary binarydata, without the need for prescreening the data. That's thefundamental problem here.

I often want to take binary files into account


In those cases I suggest using a unibyte C locale.


I still want "." to match a single (valid) UTF-8 character.

How about this idea instead? Use a unibyte C locale, and write aunibyte regular expression C that matches a single valid UTF-8 character(using whatever definition you like for UTF-8). Then, you can use . tomatch single bytes and C to match characters. This gives you all thepower you need, without the slowdown due to UTF-8 processing, a slowdownthat will be inevitable no matter how we change grep or libpcre.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/11
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/11
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Eric Blake, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Santiago Ruano Rincón, 2014/09/18
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/18
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/19

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread