bug#18266: handling bytes not part of the charset, and other garbage

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage

From:	Paul Eggert
Subject:	bug#18266: handling bytes not part of the charset, and other garbage
Date:	Fri, 12 Sep 2014 09:16:45 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

Glibc regards it as ASCII:

You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX workthe way that I thought, though. Plus, in GNU regular expressions thepattern "." works the way that I thought with LC_ALL=C; my guess(without investigating this) is that this is because whoever wrote theregex code assumed the BSDish behavior. Arguably this is a glitch inthe GNU regex code, in that for consistency "." should not matchencoding errors in unibyte locales.


Here's a pair of test cases to illustrate the glitch:

$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
      0       0       0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
      1       0       2

I just mean that "grep ." is a method given by some people, that
was working before UTF-8.


And it still works, if by "." one means "match one character".

Unfortunately there is no POSIX regular expression that does what you'relooking for (match either one character, or a single byte that is anencoding error). This is because POSIX says the behavior is undefinedon encoding errors. The GNU syntax for regular expressions extendsPOSIX and does not dump core, but it still provides no way to write thepattern you're asking for, and the behavior is unspecified on encodingerrors. Perhaps this should be improved by fixing the abovementionedglitch and by providing a syntax extension for matching encoding errors,though we'd need a volunteer to do that.

The situation with libpcre is weirder: there's a pattern '\C' formatching a single byte even if it's an encoding error, but as far as Ican tell there's no way to use regular expressions safely on arbitrarydata containing encoding errors unless you're in unibyte mode (in whichcase '\C' provides no extra power). I.e., \C appears to be useless inany program for which undefined behavior is unacceptable.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error, (continued)

Prev by Date: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18266: handling bytes not part of the charset, and other garbage
Next by thread: bug#18266: handling bytes not part of the charset, and other garbage
Index(es):
- Date
- Thread