bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Regex treatment of NUL characters within fields


From: Andrew J. Schorr
Subject: Re: [bug-gawk] Regex treatment of NUL characters within fields
Date: Mon, 30 Mar 2015 10:09:30 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Sun, Mar 29, 2015 at 08:51:30PM +0100, Matt Wenham wrote:
> I have found a use case which has made me unsure as to how gawk 4.1.1
> treats NUL characters within fields and how they are parsed by the
> regex engine.
> 
> I have a series of files which I am trying to process and validate
> using gawk. A small number of the files are corrupt and contain runs
> of NUL characters which I would like to reject as invalid.
> 
> I tried the following code:
> 
> BEGIN {
>     FS="[#/]"   #Split at hash or slash
>     OFS = ":"
> }
> 
> $10 ~ "^7$" {
>     print NR, $10
> }
> 
> This successfully matches the digit '7' followed by a run of NULs in
> the tenth field. However, using
> 
> $10 ~ "^7\0+$"
> 
> fails to match the same tenth field despite the explicitly specified
> NUL character. From everything I've read, this is unexpected
> behaviour.
> 
> I am using GnuWin32 in this case. I asked about the issue on
> Stackoverflow, and another user has found that this behaviour does not
> occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
> debian unstable.
> 
> Is this expected behaviour? If so how? Is it possible to successfully
> parse NUL characters in 4.1.1?

I built a test input file as follows (attached):

   echo -n 1/2/3/4/5/6/7/8/9/7 > /tmp/nuls
   dd if=/dev/zero bs=10b count=1 >> /tmp/nuls
   echo "" >> /tmp/nuls

It works fine with me in the gawk master and gawk-4.1-stable branches:

bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7$" {print NR, $10}' 
/tmp/nuls
bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7\0+$" {print NR, $10}' 
/tmp/nuls
1:7
bash-4.2$ ./gawk 'BEGIN {FS="[#/]"; OFS = ":"} $10 ~ "^7\0+$" {print NR, $10}' 
/tmp/nuls | od -c
0000000   1   :   7  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0012000  \0  \0  \0  \n
0012004

I downloaded and built gawk-4.1.1, and it is broken in that version.  So it
appears that the bug has been fixed, but not yet in a released version.

Regards,
Andy

Attachment: nuls
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]