[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] characters-as-bytes switch
From: |
Aharon Robbins |
Subject: |
Re: [bug-gawk] characters-as-bytes switch |
Date: |
Mon, 18 Jun 2012 23:15:48 +0300 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi. Thanks for the bug report. I have reproduced this here under GNU/Linux.
I will work on a fix.
Thanks,
Arnold
> From: "SP" <address@hidden>
> To: <address@hidden>
> Date: Sun, 17 Jun 2012 00:55:16 +0200
> Subject: [bug-gawk] characters-as-bytes switch
>
> Hello,
>
> Sorry for my approximate english, I'm french ;-)
>
> Well, I've just installed the latest cygwin binaries under Windows 7, in
> order to have a gawk with "characters-as-bytes" switch. Unfortunately, this
> switch doesn't seem to act correctly within pattern. Here is a full log
> demonstrating the problem. Note that \xE2\x80\x93 is a valid UTF-8
> character, not \xE2\x80\x42, and note the period in the gensub pattern.
>
> ==========
>
> C:\>ver
> Microsoft Windows [Version 6.1.7601]
>
> C:\>gawk.exe --version
> GNU Awk 4.0.1
> ...
> blah blah
>
> C:\>gawk.exe 'BEGIN { print "\xE2\x80\x93"; exit }' | gawk.exe
> --characters-as-bytes "{ print gensub(/\xE2\x80./,""ZZZ"",""g"",$0)}" | od
> -c -t x1
>
> 0000000 342 200 223 \n
> e2 80 93 0a
> 0000004
>
> C:\>gawk.exe 'BEGIN { print "\xE2\x80\x42"; exit }' | gawk.exe
> --characters-as-bytes "{ print gensub(/\xE2\x80./,""ZZZ"",""g"",$0)}" | od
> -c -t x1
>
> 0000000 Z Z Z \n
> 5a 5a 5a 0a
> 0000004
>
> ==========
>
> If I inject a real UTF-8 char, /\xE2\x80./ doestn't match despite
> --characters-as-bytes. And if I inject an invalid UTF-8 char /\xE2\x80./
> matches.
>
> Thanks by advance for your help in circumvention and/or correction of this
> problem !
>
> St?phane