[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac
From: |
Aharon Robbins |
Subject: |
Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters |
Date: |
Mon, 24 Aug 2015 21:47:36 +0300 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
> To: address@hidden
> From: Stephane Chazelas <address@hidden>
> Date: Sun, 23 Aug 2015 22:32:12 +0100
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII
> characters
>
> Note that in a UTF-8 locale, that testdata is not valid text.
> Those bytes don't form valid characters.
>
> While the behaviour would be unspecified by POSIX, here I'd
> agree gawk has some inconsistency in that those invalid by
> sequences are considered of length 0 for length, index and
> substr but of length 1 for match.
I think it's the other way around, they're 0 for match and 1 for
the others.
I think this patch, which is a bit of a hack, improves things.
It at least gets the "right" results for Janis's data and doesn't
break the test suite.
I will likely push this, or something like it with more comments.
Arnold
------------------------------------------------
diff --git a/node.c b/node.c
index 1741a13..b33a4f6 100644
--- a/node.c
+++ b/node.c
@@ -734,14 +734,20 @@ str2wstr(NODE *n, size_t **ptr)
warned = true;
lintwarn(_("Invalid multibyte data detected.
There may be a mismatch between your data and your locale."));
}
+ if (using_utf8()) {
+ count = 1;
+ wc = 0xFFFD; /* unicode replacement
character */
+ goto got_wc;
+ }
break;
case 0:
count = 1;
/* fall through */
default:
- *wsp++ = wc;
src_count -= count;
+ got_wc:
+ *wsp++ = wc;
while (count--) {
if (ptr != NULL)
(*ptr)[sp - n->stptr] = i;