bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage


From: Paul Eggert
Subject: bug#18266: handling bytes not part of the charset, and other garbage
Date: Thu, 11 Sep 2014 18:16:29 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:
the C locale corresponds to ANSI_X3.4-1968,

No it doesn't, at least not on any current platform I'm aware of. And POSIX does not require that. POSIX even allows the C locale to be multibyte, e.g., UTF-8.

I would say that this should be the same for invalid
byte sequences in a UTF-8 locale.

One *could* design an encoding with that property, but it wouldn't be UTF-8; it would be something else. I don't know of any C library that does that to UTF-8. There are good arguments against doing it, e.g., one loses the property that one can concatenate character strings by concatenating their byte representations.

Anyway I'm afraid we may be going off the deep end here. After all, grep can't impose its coding system design onto the operating system; it's more the other way around.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]