[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Small printf issue
From: |
arnold |
Subject: |
Re: [bug-gawk] Small printf issue |
Date: |
Fri, 16 Jun 2017 03:57:37 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi.
Thanks for the report and test case. The patch is below. I will
push it shortly, once I've added the test case into the test suite.
Thanks,
Arnold
Hermann Peifer <address@hidden> wrote:
> Hi,
>
>
> I noted a small printf issue in a case where $1 was a Unicode bullet
> character (U+2022), UTF8-encoded as a 3-byte sequence: 0xE2 0x80 0xA2.
> The observation was that the first character of $2 gets repeated,
> somehow. Below a small example. I'm using gawk/master on macOS, my
> locale is en_US.UTF-8. The issue disappears when using: LC_ALL=C gawk '...'
>
>
> Hermann
>
>
> # Umlauts (2 bytes only) seem to be OK
>
> $ printf "\xC3\x96 ABC\n" | gawk '{printf "%-5s%s\n", $1, $2}'
> ?? ABC
>
>
> # 1 repeated character where $1 is a 3-bytebullet character (U+2022)
>
> $ printf "\xE2\x80\xA2 ABC\n" | gawk '{printf "%-5s%s\n", $1, $2}'
> ??? A ABC
>
>
> # 2 repeated characters where $1 is a 4-byteair symbol (U+1F701)
>
> $ printf "\xF0\x9F\x9C\x81 ABC\n" | gawk '{printf "%-5s%s\n", $1, $2}'
> ???? AB ABC
>
>
>
-----------------------------------------------
diff --git a/builtin.c b/builtin.c
index 87d9dcb..724be05 100644
--- a/builtin.c
+++ b/builtin.c
@@ -4152,12 +4152,13 @@ mbc_char_count(const char *ptr, size_t numbytes)
if (mb_len <= 0)
return numbytes; /* no valid m.b. char */
- for (; numbytes > 0; numbytes--) {
+ while (numbytes > 0) {
mb_len = mbrlen(ptr, numbytes, &cur_state);
if (mb_len <= 0)
break;
sum++;
ptr += mb_len;
+ numbytes -= mb_len;
}
return sum;