bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage


From: Paul Eggert
Subject: bug#18266: handling bytes not part of the charset, and other garbage
Date: Thu, 11 Sep 2014 09:22:49 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Vincent Lefevre wrote:

There's no reason that '.' matches something that doesn't belong to
the charset in C locale, but doesn't match in a UTF-8 locale.

In the C locale on GNU/Linux, all byte values are members of the charset. That is why it's OK for '.' to accept that byte in the C locale but reject it in a UTF-8 locale.

It's annoying that now in UTF-8, one can no longer match ISO-8859-1 text

This has been true for quite some time in 'grep', at least with the standard matchers. It may not have been true for -P but that relied on undefined behavior that could crash grep, and we can't have that.

It would make sense to add a notation to mean "match any character or invalid byte", as an extension. That'd take some work, though.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]