bug#18266: handling bytes not part of the charset, and other garbage

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage

From:	Paul Eggert
Subject:	bug#18266: handling bytes not part of the charset, and other garbage
Date:	Fri, 12 Sep 2014 17:57:39 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

I wonder whether anyone is interested in matching individual bytes
in a file regarded as UTF-8 encoded. This seems weird.

It's not weird at all. For example, suppose we invent the notation[[:error:]] to match encoding errors. Then the pattern '[[:error:]]'would match all encoding errors in a file, which could well be a usefulthing.

Currently, for example, the tz package <http://www.iana.org/time-zones>has a Make rule 'check_character_set' that verifies that the sourcefiles are all properly encoded. It executes this shell command:


! grep -nv '^.*$' file names

This relies on GNU grep's behavior that "." does not match an encodingerror. But it's a command that is not obvious. It'd be simpler andclearer to write this:


! grep -n '[[:error:]]' file names

if such a feature were available.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18266: handling bytes not part of the charset, and other garbage, (continued)

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18266: handling bytes not part of the charset, and other garbage
Next by thread: bug#18266: handling bytes not part of the charset, and other garbage
Index(es):
- Date
- Thread