[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: inconsistency with counting characters vs bytes for multi-byte chara
From: |
arnold |
Subject: |
Re: inconsistency with counting characters vs bytes for multi-byte characters |
Date: |
Thu, 31 Aug 2023 21:25:53 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
That's quite interesting. I get the same results on Linux.
I will investigate.
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> Configuration Information [Automatically generated, do not change]:
> Machine: x86_64
> OS: cygwin
> Compiler: gcc
> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> --param=ssp-buffer-size=4
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
>
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
>
> -DNDEBUG
> uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64
> 2023-08-17 17:02 UTC x86_64 Cygwin
> Machine Type: x86_64-pc-cygwin
>
> Gawk Version: 5.2.2
>
> Attestation 1:
> I have read
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> Yes
>
> Attestation 2:
> I have not modified the sources before building gawk.
> True
>
> Description:
> Different string handling functions produce different results
> for multi-byte characters.
>
> Repeat-By:
> Without "-b":
>
> $ awk 'BEGIN{str="\342\200\257"; print length(str);
> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
> 1
> 1
> 4
>
> Note that length() thinks that string is 1 character, the first
> call to match() agrees, but then the 2nd call to match() thinks it's 3
> characters (since RSTART tells us the "end of string" is at position 4).
>
> Now with "-b" ("Cause gawk to treat all input data as
> single-byte characters" per
> https://www.gnu.org/software/gawk/manual/gawk.html#Options):
>
> $ awk -b 'BEGIN{str="\342\200\257"; print length(str);
> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
> 3
> 3
> 4
>
> Note that length() now thinks that string is 3 characters, the
> first call to match() agrees again, and then the 2nd call to match() now
> also agrees.
>
> Per the manual "in gawk, length(), substr(), split(), match()
> and the other string functions ... all work in terms of characters in
> the local character set, and not in terms of bytes." (from
> https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
>
> so I was expecting more consistent results between those 3 function
> calls and that they'd basically all always agree with length()s results.
> It may just be "match()" that has an issue, I haven't noticed a problem
> with any other function but I haven't been looking for it.