bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU grep back references


From: Julian Foad
Subject: Re: GNU grep back references
Date: Mon, 10 Oct 2005 12:59:46 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511

Jan Schampera wrote:
Hello there,

Grep seems to forget back references to \(\)-expressions when reading a
new line of input. I'm sure this is useful sometimes, but that's not the behaviour i'd
expect when reading the IEEE Std 1003.1, 2004 Edition (I'm sure it was
there, earlier, maybe in POSIX basic documents, too, just can't find it
by now):

"The back-reference expression '\n' shall match the same (possibly
empty) string of characters as was matched by a subexpression enclosed
between "\(" and "\)" preceding the '\n'." [IEEE1003.1-BRE]

For reference, here is an on-line copy: <http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html>

This definition specifies how a Basic Regular Expression matches (part of) a string. Your query depends on how Grep applies BREs to strings. I believe Grep is supposed to attempt to match each line of input individually against the BRE, not treat the whole input as one long string. Note the following paragraph from section 9.2:

"[...] Some utilities employing regular expressions limit the processing to lines; [...]. Those utilities (like grep) that do not allow <newline>s to match are responsible for eliminating any <newline> from strings before matching against the RE. [...]"

From the Single UNIX Specification v2 Grep page <http://www.opengroup.org/onlinepubs/007908799/xcu/grep.html>, Description, second paragraph:

  "[...] since patterns are matched against individual lines of the input [...]"


Note also the following sentence from the same paragraph as your quote, 9.3.6.3:

"When the referenced subexpression matched more than one string, the back-referenced expression shall refer to the last matched string."

It seems that SVR4's Grep, if it is intentionally treating the whole input as one long string, is in any case wrongly taking the back-reference to refer to the _first_ string matched by the subexpression.


I hope this helps and shows that GNU Grep is doing the right thing.

- Julian


The SVR4 grep utility (usually /usr/xpg4/bin/grep) acts as expected, it
"remembers" the first \(\)-expression for its back reference, regardless
how much input it reads.

You see it for example when grep matches lines like
"\([[:digit:]]\)\{2\}\1"
The GNU grep behaviour is to match every line that looks like (letters
are digits): "ABAB", "CDCD", "EFEF".

The grep fro the xpg4 package matches only the lines that contain (using
the back reference) the very first \(\)-expression's literal content.

Summary:
GNU grep "forgets" back references on every new line of input
xpg4 grep "remembers" the very first matched content over all its input

Any comments to this behaviour?

Best regards and thanks for your work,
Jan

[IEEE1003.1-BRE]
The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Chapter 9.3.6, Paragraph 3




reply via email to

[Prev in Thread] Current Thread [Next in Thread]