[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Regex treatment of NUL characters within fields
From: |
Andrew J. Schorr |
Subject: |
Re: [bug-gawk] Regex treatment of NUL characters within fields |
Date: |
Mon, 30 Mar 2015 10:09:30 -0400 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On Sun, Mar 29, 2015 at 08:51:30PM +0100, Matt Wenham wrote:
> I have found a use case which has made me unsure as to how gawk 4.1.1
> treats NUL characters within fields and how they are parsed by the
> regex engine.
>
> I have a series of files which I am trying to process and validate
> using gawk. A small number of the files are corrupt and contain runs
> of NUL characters which I would like to reject as invalid.
>
> I tried the following code:
>
> BEGIN {
> FS="[#/]" #Split at hash or slash
> OFS = ":"
> }
>
> $10 ~ "^7$" {
> print NR, $10
> }
>
> This successfully matches the digit '7' followed by a run of NULs in
> the tenth field. However, using
>
> $10 ~ "^7\0+$"
>
> fails to match the same tenth field despite the explicitly specified
> NUL character. From everything I've read, this is unexpected
> behaviour.
>
> I am using GnuWin32 in this case. I asked about the issue on
> Stackoverflow, and another user has found that this behaviour does not
> occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
> debian unstable.
>
> Is this expected behaviour? If so how? Is it possible to successfully
> parse NUL characters in 4.1.1?
I built a test input file as follows (attached):
echo -n 1/2/3/4/5/6/7/8/9/7 > /tmp/nuls
dd if=/dev/zero bs=10b count=1 >> /tmp/nuls
echo "" >> /tmp/nuls
It works fine with me in the gawk master and gawk-4.1-stable branches:
bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7$" {print NR, $10}'
/tmp/nuls
bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7\0+$" {print NR, $10}'
/tmp/nuls
1:7
bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7\0+$" {print NR, $10}'
/tmp/nuls | od -c
0000000 1 : 7 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000020 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
0012000 \0 \0 \0 \n
0012004
I downloaded and built gawk-4.1.1, and it is broken in that version. So it
appears that the bug has been fixed, but not yet in a released version.
Regards,
Andy
nuls
Description: Binary data