|
From: | Paul Eggert |
Subject: | bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: | Fri, 12 Sep 2014 09:48:08 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 |
Vincent Lefevre wrote:
I think that (1) is rather simple
You may think it simple for the REs you're interested in, but someone else might say "hey! that doesn't cover the REs *I'm* interested in!". Solving the problem in general is nontrivial.
But this is already the case:
I was assuming the case where the input data contains an encoding error (not a null byte) that is transformed to a null byte before the user sees it.
Really, this null-byte-replacement business would be just too weird. I don't see it as a viable general-purpose solution.
Parsing UTF-8 is standard.
It's a standard that keeps evolving, different releases of libpcre have done it differently, and I expect things to continue to evolve. It's not something I would want to maintain separately from libpcre itself.
Have you investigated why libpcre is so *slow* when doing UTF-8 checking? Why would libpcre be 10x slower than grep's checking by hand?!? I don't get it. Surely there's a simple fix on the libpcre side.
I often want to take binary files into account
In those cases I suggest using a unibyte C locale. This should solve the performance problem. Really, unibyte is the way to go here; it's gonna be faster for large binary scanning no matter what is done about this UTF-8 business.
[Prev in Thread] | Current Thread | [Next in Thread] |