Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug

Wolfgang,

I fully agree that implementing something like \x{…} (i.e., supporting a variable number of hex. digits) so you can express the full range of Unicode codepoints is the right thing to do.

It would avoid all the problems you state.

As long as we don't have that, you're stuck with the workaround, which is - as workarounds often are - suboptimal, for the reasons you state.

However, as is inherent to workarounds, they offer the only way to get the job done in the absence of proper support.

What we can hopefully agree on is this:

Gawk shouldn't crash with something like gawk '{ gsub(/[\x80-\xFF]/, ""); print }' <<<'hätă' (it sounds like this may have been fixed in the current codebase already)
In a UTF-8 locale, \xhh escapes should be matched against the Unicode codepoints, not byte values.
The limitation of \xhh escapes only being capable of matching characters in the range 0x0 - 0xff should be documented.

Personally, I would also want the documentation to at least tell me that using actual Unicode characters is a potential workaround, if one's environment supports it.

Best,

Michael

On Feb 7, 2016, at 11:14 AM, Wolfgang Laun <address@hidden> wrote:

On 7 February 2016 at 16:54, Michael Klement <address@hidden> wrote:

Generally, it sounds like the right thing to do is:

in a UTF-8 locale, *always* deal with *characters* (Unicode codepoints), not bytes
specifically, when encountering \xhh, compare it to the *Unicode codepoint* of the character at hand

Always dealing with characters makes sense to me, especially given that you can mix Unicode characters and \xhh escapes in a single bracket _expression_.

Thus, given that \xff is the max. codepoint value that can currently be expressed, which doesn't allow matching the full range of Unicode characters, I suggest the following:
At https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions:
document this limitation
recommend the workaround of using actual characters rather than codepoint escapes as the range endpoints.
So if an awk program file requires such (UTF-8 encoded) characters: is your editor capable of handling the characters you need, and is the keyboard (and the skill with it) you have sufficient for typing it? Also consider what happens if a not-so-capable editor is handling such a file. - The various escapes for typing a character may be the only way to guarantee portability without being stymied by the ubiquitious � - haven't we all seen it?

Best
Wolfgang

From:	Michael Klement
Subject:	Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date:	Sun, 7 Feb 2016 11:31:53 -0500