bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Tue, 30 Sep 2014 12:39:17 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 09/30/2014 11:10 AM, Zoltán Herczeg wrote:

Grep already does that sort of thing.  And it's smart enough to start matching
only at character boundaries.  It's not libpcre's job to worry about this; the
caller can worry about it.

Thank you for bringing this up. I don't see any point of reimplementing what is 
already there.

Sorry, it sounds like my earlier comment was unclear. GNU grep is smartenough to start matching at character boundaries without checking thevalidity of the input data. This helps it run faster. However, becauselibpcre requires a validity prepass, grep -P must slow down and do thevalidity check one way or another. Grep does this only when libpcre isused, and that's one reason grep -P is slower than plain grep.

It's not a question of duplicating code: grep already has code tovalidate binary data. It's a question of performance. Requiring aprepass for validity checking is typically slower (or takes more energy,or whatever) than checking validity on the fly. And in many cases goingmultithreaded would just make matters worse.

I can understand that you don't want to take on the burden of making anontrivial libpcre performance improvement. Also, I hope 'grep -P'performance, though not great, is good enough now to satisfy mostusers. So perhaps we should just give the topic a rest.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, (continued)

Prev by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18455: grep 2.20 perl-regexp: invalid UTF-8 byte sequence in input
Index(es):
- Date
- Thread