bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage


From: Paul Eggert
Subject: bug#18266: handling bytes not part of the charset, and other garbage
Date: Fri, 12 Sep 2014 09:16:45 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:
Glibc regards it as ASCII:

You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work the way that I thought, though. Plus, in GNU regular expressions the pattern "." works the way that I thought with LC_ALL=C; my guess (without investigating this) is that this is because whoever wrote the regex code assumed the BSDish behavior. Arguably this is a glitch in the GNU regex code, in that for consistency "." should not match encoding errors in unibyte locales.

Here's a pair of test cases to illustrate the glitch:

$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
      0       0       0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
      1       0       2

I just mean that "grep ." is a method given by some people, that
was working before UTF-8.

And it still works, if by "." one means "match one character".

Unfortunately there is no POSIX regular expression that does what you're looking for (match either one character, or a single byte that is an encoding error). This is because POSIX says the behavior is undefined on encoding errors. The GNU syntax for regular expressions extends POSIX and does not dump core, but it still provides no way to write the pattern you're asking for, and the behavior is unspecified on encoding errors. Perhaps this should be improved by fixing the abovementioned glitch and by providing a syntax extension for matching encoding errors, though we'd need a volunteer to do that.

The situation with libpcre is weirder: there's a pattern '\C' for matching a single byte even if it's an encoding error, but as far as I can tell there's no way to use regular expressions safely on arbitrary data containing encoding errors unless you're in unibyte mode (in which case '\C' provides no extra power). I.e., \C appears to be useless in any program for which undefined behavior is unacceptable.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]