Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

From:	Aharon Robbins
Subject:	Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date:	Mon, 24 Aug 2015 21:47:36 +0300
User-agent:	Heirloom mailx 12.5 6/20/10

> To: address@hidden
> From: Stephane Chazelas <address@hidden>
> Date: Sun, 23 Aug 2015 22:32:12 +0100
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII
>       characters
>
> Note that in a UTF-8 locale, that testdata is not valid text.
> Those bytes don't form valid characters.
>
> While the behaviour would be unspecified by POSIX, here I'd
> agree gawk has some inconsistency in that those invalid by
> sequences are considered of length 0 for length, index and
> substr but of length 1 for match.

I think it's the other way around, they're 0 for match and 1 for
the others.

I think this patch, which is a bit of a hack, improves things.
It at least gets the "right" results for Janis's data and doesn't
break the test suite.

I will likely push this, or something like it with more comments.

Arnold
------------------------------------------------
diff --git a/node.c b/node.c
index 1741a13..b33a4f6 100644
--- a/node.c
+++ b/node.c
@@ -734,14 +734,20 @@ str2wstr(NODE *n, size_t **ptr)
                                warned = true;
                                lintwarn(_("Invalid multibyte data detected. 
There may be a mismatch between your data and your locale."));
                        }
+                       if (using_utf8()) {
+                               count = 1;
+                               wc = 0xFFFD;    /* unicode replacement 
character */
+                               goto got_wc;
+                       }
                        break;
 
                case 0:
                        count = 1;
                        /* fall through */
                default:
-                       *wsp++ = wc;
                        src_count -= count;
+               got_wc:
+                       *wsp++ = wc;
                        while (count--)  {
                                if (ptr != NULL)
                                        (*ptr)[sp - n->stptr] = i;

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] Problem with substr() after match() with non-ASCII characters, Janis Papanagnou, 2015/08/22
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Stephane Chazelas, 2015/08/24
  - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins <=
    - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Hermann Peifer, 2015/08/24
    - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/31
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/24
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/31

Prev by Date: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Next by Date: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Previous by thread: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Next by thread: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Index(es):
- Date
- Thread