|
From: | Paul Eggert |
Subject: | bug#18266: handling bytes not part of the charset, and other garbage |
Date: | Fri, 12 Sep 2014 19:08:38 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 |
Vincent Lefevre wrote:
But both of these solutions have the drawback of working only in UTF-8 locales.
Not at all; '[[:error:]]' would match a single-byte encoding error in the current locale. The tz database is interested in UTF-8 so it sets the LC_ALL environment variable to a UTF-8 locale, but that setting shouldn't be required in general.
Also, the tz database needs grep patterns that iconv doesn't support. For example, one rule is that commentary (which starts with #) can contain UTF-8 characters, but the ordinary data (before the #) is limited to a smaller set. This is captured by the command:
grep -Env '^[ordinarycharset]*(#.*)?$'where 'ordinarycharset' is the set of ASCII characters in ordinary tz data. Here it's useful that '.' does not match encoding errors on GNU/Linux.
[Prev in Thread] | Current Thread | [Next in Thread] |