bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac


From: Stephane Chazelas
Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date: Sun, 23 Aug 2015 22:32:12 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

2015-08-22 22:33:52 +0200, Janis Papanagnou:
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
> 
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
> 
>   2007
>   0703
>   2007
>   0071
> 
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
> 
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
> 
> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.
[...]

Note that in a UTF-8 locale, that testdata is not valid text.
Those bytes don't form valid characters.

While the behaviour would be unspecified by POSIX, here I'd
agree gawk has some inconsistency in that those invalid by
sequences are considered of length 0 for length, index and
substr but of length 1 for match.

To me, the best approach would be that they be of length 1 all
the time (and that they also match /./ (they don't in GNU tools
in general, they don't even match ? in GNU fnmatch, though they
do in the GNU shell's ?)).

Here though, you should use a locale where that data is valid
text. If you don't know the encoding but don't care an know it's
single-byte, the C locale is a good option.

-- 
Stephane




reply via email to

[Prev in Thread] Current Thread [Next in Thread]