bug#18266: handling bytes not part of the charset, and other garbage

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage

From:	Paul Eggert
Subject:	bug#18266: handling bytes not part of the charset, and other garbage
Date:	Fri, 12 Sep 2014 19:08:38 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

But both of these solutions have the drawback of working only in
UTF-8 locales.

Not at all; '[[:error:]]' would match a single-byte encoding error inthe current locale. The tz database is interested in UTF-8 so it setsthe LC_ALL environment variable to a UTF-8 locale, but that settingshouldn't be required in general.

Also, the tz database needs grep patterns that iconv doesn't support.For example, one rule is that commentary (which starts with #) cancontain UTF-8 characters, but the ordinary data (before the #) islimited to a smaller set. This is captured by the command:


grep -Env '^[ordinarycharset]*(#.*)?$'

where 'ordinarycharset' is the set of ASCII characters in ordinary tzdata. Here it's useful that '.' does not match encoding errors onGNU/Linux.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18266: handling bytes not part of the charset, and other garbage, (continued)

Prev by Date: bug#18266: handling bytes not part of the charset, and other garbage
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18266: handling bytes not part of the charset, and other garbage
Next by thread: bug#18266: handling bytes not part of the charset, and other garbage
Index(es):
- Date
- Thread