[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU grep back references
From: |
Julian Foad |
Subject: |
Re: GNU grep back references |
Date: |
Mon, 10 Oct 2005 12:59:46 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511 |
Jan Schampera wrote:
Hello there,
Grep seems to forget back references to \(\)-expressions when reading a
new line of input.
I'm sure this is useful sometimes, but that's not the behaviour i'd
expect when reading the IEEE Std 1003.1, 2004 Edition (I'm sure it was
there, earlier, maybe in POSIX basic documents, too, just can't find it
by now):
"The back-reference expression '\n' shall match the same (possibly
empty) string of characters as was matched by a subexpression enclosed
between "\(" and "\)" preceding the '\n'." [IEEE1003.1-BRE]
For reference, here is an on-line copy:
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html>
This definition specifies how a Basic Regular Expression matches (part of) a
string. Your query depends on how Grep applies BREs to strings. I believe
Grep is supposed to attempt to match each line of input individually against
the BRE, not treat the whole input as one long string. Note the following
paragraph from section 9.2:
"[...] Some utilities employing regular expressions limit the processing to
lines; [...]. Those utilities (like grep) that do not allow <newline>s to match
are responsible for eliminating any <newline> from strings before matching
against the RE. [...]"
From the Single UNIX Specification v2 Grep page
<http://www.opengroup.org/onlinepubs/007908799/xcu/grep.html>, Description,
second paragraph:
"[...] since patterns are matched against individual lines of the input [...]"
Note also the following sentence from the same paragraph as your quote, 9.3.6.3:
"When the referenced subexpression matched more than one string, the
back-referenced expression shall refer to the last matched string."
It seems that SVR4's Grep, if it is intentionally treating the whole input as
one long string, is in any case wrongly taking the back-reference to refer to
the _first_ string matched by the subexpression.
I hope this helps and shows that GNU Grep is doing the right thing.
- Julian
The SVR4 grep utility (usually /usr/xpg4/bin/grep) acts as expected, it
"remembers" the first \(\)-expression for its back reference, regardless
how much input it reads.
You see it for example when grep matches lines like
"\([[:digit:]]\)\{2\}\1"
The GNU grep behaviour is to match every line that looks like (letters
are digits): "ABAB", "CDCD", "EFEF".
The grep fro the xpg4 package matches only the lines that contain (using
the back reference) the very first \(\)-expression's literal content.
Summary:
GNU grep "forgets" back references on every new line of input
xpg4 grep "remembers" the very first matched content over all its input
Any comments to this behaviour?
Best regards and thanks for your work,
Jan
[IEEE1003.1-BRE]
The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Chapter 9.3.6, Paragraph 3