bug#18266: handling bytes not part of the charset, and other garbage

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage

From:	Vincent Lefevre
Subject:	bug#18266: handling bytes not part of the charset, and other garbage
Date:	Sat, 13 Sep 2014 00:40:33 +0200
User-agent:	Mutt/1.5.23-6361-vl-r59709 (2014-07-25)

On 2014-09-12 14:39:35 -0700, Paul Eggert wrote:
> On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
> >an option to control what happens on encoding errors would be
> >better and sufficient.
> 
> It might suffice for your use cases, but it's more complicated and less
> flexible than being able to match bytes within the regular expression.

But IMHO, some solutions I proposed would be faster.

I wonder whether anyone is interested in matching individual bytes
in a file regarded as UTF-8 encoded. This seems weird.

> Speaking of hairy, why doesn't grep use PCRE_MULTILINE?  Using
> PCRE_MULTILINE shouldn't be that hard, and should boost performance
> quite a bit in typical usage.  Or am I being too optimistic here?

Perhaps in text files. In binary files, with the current solution,
I don't think this matters as failures due to invalid bytes
typically occur several times per line.

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18266: handling bytes not part of the charset, and other garbage, (continued)

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18266: handling bytes not part of the charset, and other garbage
Next by thread: bug#18266: handling bytes not part of the charset, and other garbage
Index(es):
- Date
- Thread