|
From: | Paul Eggert |
Subject: | bug#20526: BUG: text file is detected as binary |
Date: | Tue, 12 May 2015 17:08:42 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 |
Eric Blake wrote:
I'm still a bit worried that encoding errors encountered on input, even though they don't match for output, may still cause issues for some patterns (we've had cases of encoding errors causing 'grep -P' to go into an infinite loop, for example);
Yes, that's right. We can't go back to the old way of doing things. Encoding errors in the data must not be matched by any regular expression (not even "."). 'grep -P' won't loop if we never pass encoding errors to the PCRE matcher, so that's what we gotta do.
but yes, as the behavior is undefined, we are still justified in adopting those heuristics, if someone is willing to contribute a patch along those lines.
The hard part about it (and the reason I haven't written up a patch yet) is making sure the above property holds, while continuing to have good performance in the typical case where the input is validly encoded. I suppose it's OK, though, if the change hurts performance only for the -P case, since -P is so slow anyway.
[Prev in Thread] | Current Thread | [Next in Thread] |