bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Fri, 12 Sep 2014 09:48:08 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

I think that (1) is rather simple

You may think it simple for the REs you're interested in, but someoneelse might say "hey! that doesn't cover the REs *I'm* interested in!".Solving the problem in general is nontrivial.

But this is already the case:

I was assuming the case where the input data contains an encoding error(not a null byte) that is transformed to a null byte before the usersees it.

Really, this null-byte-replacement business would be just too weird. Idon't see it as a viable general-purpose solution.

Parsing UTF-8 is standard.

It's a standard that keeps evolving, different releases of libpcre havedone it differently, and I expect things to continue to evolve. It'snot something I would want to maintain separately from libpcre itself.

Have you investigated why libpcre is so *slow* when doing UTF-8checking? Why would libpcre be 10x slower than grep's checking byhand?!? I don't get it. Surely there's a simple fix on the libpcre side.

I often want to take binary files into account

In those cases I suggest using a unibyte C locale. This should solvethe performance problem. Really, unibyte is the way to go here; it'sgonna be faster for large binary scanning no matter what is done aboutthis UTF-8 business.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/11
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/11
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Eric Blake, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/18
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Santiago Ruano Rincón, 2014/09/18

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread