Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

From:	Aharon Robbins
Subject:	Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date:	Mon, 24 Aug 2015 18:30:48 +0300
User-agent:	Heirloom mailx 12.5 6/20/10

Hi.

> From: Janis Papanagnou <address@hidden>
> To: "address@hidden" <address@hidden>
> Date: Sat, 22 Aug 2015 22:33:52 +0200
> Subject: [bug-gawk] Problem with substr() after match() with non-ASCII
>       characters
>
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
>
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
>
>   2007
>   0703
>   2007
>   0071
>
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
>
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.

The problem is that you're feeding gawk invalid multibyte data for
the locale you're in. When gawk tries to figure out where, in terms of
characters, the match starts, it gets confused because of this invalid
data.

        $ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata 
        2007
        gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte 
data detected. There may be a mismatch between your data and your locale.
        0703
        2007
        0071

> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.

*When the data is valid*, this is correct and things work as expected.
In your case, it's Garbage In, Garbage Out. :-(

If there's a way to set the locale to latin-whatever for where you
are, then things will probably work ok. Otherwise, you should use
LC_ALL=C or the -b option.

There really is no way around this; the underlying C library routines
depend on the value of the locale variables in order to interpret
the input data.

HTH,

Arnold

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] Problem with substr() after match() with non-ASCII characters, Janis Papanagnou, 2015/08/22
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Stephane Chazelas, 2015/08/24
  - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/24
    - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Hermann Peifer, 2015/08/24
    - Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/31
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins <=
- Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters, Aharon Robbins, 2015/08/31

Prev by Date: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Next by Date: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Previous by thread: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Next by thread: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Index(es):
- Date
- Thread