|
From: | Paul Eggert |
Subject: | bug#18266: handling bytes not part of the charset, and other garbage |
Date: | Thu, 11 Sep 2014 18:16:29 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 |
Vincent Lefevre wrote:
the C locale corresponds to ANSI_X3.4-1968,
No it doesn't, at least not on any current platform I'm aware of. And POSIX does not require that. POSIX even allows the C locale to be multibyte, e.g., UTF-8.
I would say that this should be the same for invalid byte sequences in a UTF-8 locale.
One *could* design an encoding with that property, but it wouldn't be UTF-8; it would be something else. I don't know of any C library that does that to UTF-8. There are good arguments against doing it, e.g., one loses the property that one can concatenate character strings by concatenating their byte representations.
Anyway I'm afraid we may be going off the deep end here. After all, grep can't impose its coding system design onto the operating system; it's more the other way around.
[Prev in Thread] | Current Thread | [Next in Thread] |