[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Computed regex and getline bug / issue
From: |
Andrew J. Schorr |
Subject: |
Re: [bug-gawk] Computed regex and getline bug / issue |
Date: |
Tue, 6 May 2014 10:18:21 -0400 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Hi,
On Mon, May 05, 2014 at 09:04:23AM +0300, Aharon Robbins wrote:
> It is a heuristic. Consider an RS like what we have: RS = ",+". Here,
> we want as many commas as we can possibly slurp up. Now consider a file
> like so, where the | indicates a file block boundary:
>
> .... ,,, | ,, ...
>
> rsrescan has seen the first three commas, but it doesn't know if the
> next block starts with a comma, or with something else. So it tells
> get_a_record, "read some more data and retry", in case there's more
> stuff that could be matched.
>
> This was done to solve a real problem I encountered, where something
> like foo(bar)* was the RS and the "foo" fell exactly on the end of
> the block boundary; even though there was a "bar" at the beginning of
> the next block, gawk wasn't picking it up.
That makes some sense conceptually. I think it would be wise to have
test cases for both of these problems.
I tried to make a test case for the problem you describe, and I am
not having any luck. Can you see what I'm doing wrong? The input
file blockboundary.in is attached.
With gawk 4.1.1:
bash-4.2$ AWKBUFSIZE=7 /bin/gawk -v "RS=foo(bar)*" 1 < blockboundary.in
cats
dogs
mice
bats
bash-4.2$ AWKBUFSIZE=7 /bin/gawk -v "RS=foo(bar)*" '{print; rc = getline; print
rc; print}' < blockboundary.in
cats
1
dogs
mice
0
mice
With my patch to prevent rsrescan from returning TERMNEAREND:
bash-4.2$ AWKBUFSIZE=7 ./gawk -v "RS=foo(bar)*" 1 < blockboundary.in
cats
dogs
mice
bats
bash-4.2$ AWKBUFSIZE=7 ./gawk -v "RS=foo(bar)*" '{print; rc = getline; print
rc; print}' < blockboundary.in
cats
1
dogs
mice
1
bats
Using strace, I see that gawk seems to do a certain amount of readahead in any
case:
bash-4.2$ AWKBUFSIZE=7 strace -eread ./gawk -v "RS=foo(bar)*" 1 <
blockboundary.in
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0PO\1\0\0\0\0\0"..., 832)
= 832
read(3,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\252\0\0\0\0\0\0"..., 832) =
832
read(3,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\300\0\0\0\0\0\0"..., 832) =
832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\16\0\0\0\0\0\0"...,
832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"...,
832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"...,
832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>address@hidden"..., 832) = 832
read(0, "catsfoo", 7) = 7
read(0, "dogsfoo", 7) = 7
cats
read(0, "micefoo", 7) = 7
dogs
read(0, "barbats", 7) = 7
mice
read(0, "", 7) = 0
bats
+++ exited with 0 +++
Do you have any thoughts on how to construct a test case that will show
the TERMNEAREND problem?
Regards,
Andy
blockboundary.in
Description: Text document
- [bug-gawk] Computed regex and getline bug / issue, Grail Dane, 2014/05/04
- Re: [bug-gawk] Computed regex and getline bug / issue, Davide Brini, 2014/05/04
- Re: [bug-gawk] Computed regex and getline bug / issue, Andrew J. Schorr, 2014/05/04
- Re: [bug-gawk] Computed regex and getline bug / issue, Aharon Robbins, 2014/05/09
- Re: [bug-gawk] Computed regex and getline bug / issue, Andrew J. Schorr, 2014/05/09
- Re: [bug-gawk] Computed regex and getline bug / issue, Aharon Robbins, 2014/05/10
- Re: [bug-gawk] Computed regex and getline bug / issue, Andrew J. Schorr, 2014/05/11
- Re: [bug-gawk] Computed regex and getline bug / issue, arnold, 2014/05/11