[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18266: grep -P and invalid exits with error
From: |
Vincent Lefevre |
Subject: |
bug#18266: grep -P and invalid exits with error |
Date: |
Mon, 1 Sep 2014 10:18:22 +0200 |
User-agent: |
Mutt/1.5.23-6361-vl-r59709 (2014-07-25) |
On 2014-08-29 06:43:45 -0700, Paul Eggert wrote:
> Thanks, but that patch seems to depend on libpcre internals, in that it
> "knows" that pcre_exec cannot possibly succeed without first checking its
> entire input buffer for invalid UTF-8 bytes. Even if that's true now, it
> reflects a performance bug that might be fixed in a future libpcre version.
If I understand correctly, I don't think that's an internal.
The pcreapi(3) man page says about PCRE_NO_UTF8_CHECK:
[...] Note that this option can also be passed to pcre_exec()
and pcre_dfa_exec(), to suppress the validity checking of
subject strings only. If the same string is being matched
many times, the option can be safely set for the second and
subsequent matchings to improve performance.
The last sentence would imply that the UTF8 checking is done on the
whole input buffer before matching is done.
> Also, I don't see why grep needs to copy the buffer when there's an encoding
> error. Why not simply rerun the matcher on the initial prefix that doesn't
> have an encoding-error byte, and then (if that doesn't find a match), try
> matching the suffix after the encoding-error byte? This approach would not
> only avoid the buffer copy, it would avoid knowledge of libpcre internals.
If there are many invalid UTF8 bytes, this would be slow, IMHO (it
could be worth a try, though).
But is the copy of the buffer really needed? Couldn't the invalid
UTF8 sequences just be replaced by null bytes?
Note that in case of invalid UTF8 bytes, in some (many?) cases, the
cause is a binary file (possibly with some text in it), where lines
can be very long. So, wouldn't it mean that it can take significantly
more memory?
--
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
- bug#18266: grep -P and invalid exits with error,
Vincent Lefevre <=
- bug#18266: grep -P and invalid exits with error, Paul Eggert, 2014/09/01
- bug#18266: grep -P and invalid exits with error, Santiago, 2014/09/08
- bug#18266: grep -P and invalid exits with error, Norihiro Tanaka, 2014/09/09
- bug#18266: grep -P and invalid exits with error, Paul Eggert, 2014/09/09
- bug#18266: grep -P and invalid exits with error, Norihiro Tanaka, 2014/09/09
- bug#18266: grep -P and invalid exits with error, Paul Eggert, 2014/09/09
- bug#18266: grep -P and invalid exits with error, Paul Eggert, 2014/09/10
- bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error, Santiago, 2014/09/10
- bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error, Vincent Lefevre, 2014/09/11
- bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error, Paul Eggert, 2014/09/11