bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: grep -P and invalid exits with error


From: Paul Eggert
Subject: bug#18266: grep -P and invalid exits with error
Date: Tue, 09 Sep 2014 12:59:27 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Norihiro Tanaka wrote:
I'm worried that to re-run for invalid UTF-8 makes slowness for searching
of the large number of binary files.

Yes, that could be a problem, but even so it's better for grep to report matches than to give up and fail. Perhaps someone could optimize this better later, but to be honest given how flaky libpcre is we're probably better off spending our scarce development resources elsewhere.

Santiago's latest patch still had some troubles, unfortunately. It could mishandle '^' by having it match just past an encoding error. It was less efficient than it could be, as it checked all valid bytes for UTF-8-edness twice. If I understand PCRE correctly (which quite possibly I don't), it also appeared to mishandle matches that contain nested subexpressions. But the worst part was that the code was too complicated (and this was true even before Santiago's patch was applied). So I rewrote it and installed the attached patch instead. Please give it a try.

Attachment: 0001-grep-P-now-treats-invalid-UTF-8-input-as-non-matchin.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]