bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Regex treatment of NUL characters within fields


From: arnold
Subject: Re: [bug-gawk] Regex treatment of NUL characters within fields
Date: Mon, 30 Mar 2015 08:03:46 -0600
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

This is a fascinating bug report.  It looks like a bug in the regex
matching, since the $ should not match.  Debugging this will be a
challenge, but I'll take a look.

In the meantime, something like tr -d '\0' can be used to simply
remove NUL bytes from an imput file.

I will work on this.

Thanks,

Arnold

Matt Wenham <address@hidden> wrote:

> I have found a use case which has made me unsure as to how gawk 4.1.1
> treats NUL characters within fields and how they are parsed by the
> regex engine.
>
> I have a series of files which I am trying to process and validate
> using gawk. A small number of the files are corrupt and contain runs
> of NUL characters which I would like to reject as invalid.
>
> I tried the following code:
>
> BEGIN {
>     FS="[#/]"   #Split at hash or slash
>     OFS = ":"
> }
>
> $10 ~ "^7$" {
>     print NR, $10
> }
>
> This successfully matches the digit '7' followed by a run of NULs in
> the tenth field. However, using
>
> $10 ~ "^7\0+$"
>
> fails to match the same tenth field despite the explicitly specified
> NUL character. From everything I've read, this is unexpected
> behaviour.
>
> I am using GnuWin32 in this case. I asked about the issue on
> Stackoverflow, and another user has found that this behaviour does not
> occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
> debian unstable.
>
> Is this expected behaviour? If so how? Is it possible to successfully
> parse NUL characters in 4.1.1?
>
> Many thanks,
>
> Dr. Matt Wenham.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]