bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of '()' in a regexp


From: Ed Morton
Subject: Re: Use of '()' in a regexp
Date: Thu, 7 Jan 2021 11:02:06 -0600
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0

Sounds good. From testing how `split()` and setting `FS` behave it looks like that rule applies to Field Separators in addition to Record Separators as that would explain theses differences:

$ printf 'foo\n' | awk '{print gsub(/()/,"X")}1'
4
XfXoXoX

$ printf 'foo\n' | awk '{print split($0,a,/()/); for (i=1; i in a; i++) print a[i]}'
1
foo

I don't think that's documented anywhere currently, it may be worth a brief statement in the manual, something like "if RS is a multi-char regexp populated such that it would match a null string (e.g. `RS='()'`) then ...." and an almost identical statement where field separator values are described?

Whatever you decide... thanks for quickly looking into and providing the fix and the explanation!

    Ed.

On 1/7/2021 8:07 AM, arnold@skeeve.com wrote:
The answer is "no".  Record separators must be non-null; the only exception
where RT will be "" is at the end of a file.

This is also how Brian Kernighan's awk handles RS as a regexp.

Thanks,

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

In case that's not an adequate example, what I mean is, will this:

$ printf 'foo\nbar\n' | awk -v RS='()' -v ORS='X' '1' file

then produce the same output as this:

$ printf 'foo\nbar\n' | awk -v RS='^$' '{gsub(/()/,"X")}1'
XfXoXoX
XbXaXrX
X

or not and, if not, why is it different?

I just noticed that this seems to handle `/()/` differently from either
of the current cases again:

$ printf 'foo\nbar\n' | awk '{nf=split($0,flds,/()/,seps); print nf; for
(i=0; i<=nf; i++) printf "%s%s", flds[i], "<"seps[i]">" ; print ""}'
1
<>foo<>
1
<>bar<>

Regards,

      Ed.

On 1/6/2021 2:54 PM, Ed Morton wrote:
Great! Will that treat `()` when used in an RS:

     awk -v RS='()' -v ORS='x' '1'

the same as it's treated in a regexp in other contexts such as with
gsub():

     awk -v ORS= '{gsub(/()/,"x")} 1'

or does it mean something different when used in an RS?

     Ed.

On 1/6/2021 1:33 PM, arnold@skeeve.com wrote:
Hi. Re this:

Ed Morton<mortoneccc@comcast.net>  wrote:

Someone just pointed this out to me (gawk 5.1.0):

$ printf 'foo\n' | awk '{gsub(/()/,"x")} 1'
xfxoxox

$ printf 'foo\n' | awk -v RS='()' -v ORS='x\n' '1'
foox

Obviously that's a pretty ridiculous regexp but it still has me
wondering - why does `gsub()` treat the regexp `()` as matching a null
string around every character while `RS` treats it as if I'd asked it to
match the `\n` at the end of the input:

$ printf 'foo\n' | awk -v RS='\n$' -v ORS='x\n' '1'
foox

I could just file this under "don't write stupid regexps" but I was
wondering if there's a more concrete, satisfying explanation of the
behavior.

       Ed.
It's a bug. This appears to be the fix. It doesn't break the
test suite, either.

Thanks for the report!

Arnold
-----------------------------------------
diff --git a/io.c b/io.c
index 2714398e..0af8ab1e 100644
--- a/io.c
+++ b/io.c
@@ -3702,7 +3702,7 @@ again:
                 * If still room in buffer, skip over null match
                 * and restart search. Otherwise, return.
                 */
-               if (bp + iop->scanoff < iop->dataend) {
+               if (bp + iop->scanoff <= iop->dataend) {
                        bp += iop->scanoff;
                        goto again;
                }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]