[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Regex treatment of NUL characters within fields
From: |
arnold |
Subject: |
Re: [bug-gawk] Regex treatment of NUL characters within fields |
Date: |
Mon, 30 Mar 2015 08:03:46 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi.
This is a fascinating bug report. It looks like a bug in the regex
matching, since the $ should not match. Debugging this will be a
challenge, but I'll take a look.
In the meantime, something like tr -d '\0' can be used to simply
remove NUL bytes from an imput file.
I will work on this.
Thanks,
Arnold
Matt Wenham <address@hidden> wrote:
> I have found a use case which has made me unsure as to how gawk 4.1.1
> treats NUL characters within fields and how they are parsed by the
> regex engine.
>
> I have a series of files which I am trying to process and validate
> using gawk. A small number of the files are corrupt and contain runs
> of NUL characters which I would like to reject as invalid.
>
> I tried the following code:
>
> BEGIN {
> FS="[#/]" #Split at hash or slash
> OFS = ":"
> }
>
> $10 ~ "^7$" {
> print NR, $10
> }
>
> This successfully matches the digit '7' followed by a run of NULs in
> the tenth field. However, using
>
> $10 ~ "^7\0+$"
>
> fails to match the same tenth field despite the explicitly specified
> NUL character. From everything I've read, this is unexpected
> behaviour.
>
> I am using GnuWin32 in this case. I asked about the issue on
> Stackoverflow, and another user has found that this behaviour does not
> occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
> debian unstable.
>
> Is this expected behaviour? If so how? Is it possible to successfully
> parse NUL characters in 4.1.1?
>
> Many thanks,
>
> Dr. Matt Wenham.