bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk: {} repetition in patterns doesn't work?


From: Aharon Robbins
Subject: Re: gawk: {} repetition in patterns doesn't work?
Date: Thu, 22 Mar 2001 13:54:20 +0200

Greetings. Concerning all this:

> Date: Wed, 21 Mar 2001 10:59:55 -0800 (PST)
> From: Paul Eggert <address@hidden>
> To: address@hidden
> CC: address@hidden
> Subject: Re: gawk: {} repetition in patterns doesn't work?
>
> > From: address@hidden
> > Date: 21 Mar 2001 08:44:13 +0200
> > 
> > echo 'aa' | awk '/a{2}/' 
> > 
> > It prints 'aa' with HP-UX awk and so it should according to
> > my understanding of POSIX.2.
>
> That's correct.
>
> > It doesn't work with 'nawk' in Solaris either, though. A bug or a
> > feature?
>
> Both.:-)

Actually, it works in /usr/xpg4/bin/awk.

> The POSIX requirement is widely ignored, because it causes problems
> with patterns that contain stray '{' characters.  Historically, awk
> did not support the a{2} notation, and many awk scripts contain code
> that treat '{' as literal, e.g.:
>
>       /{.*}/ { print "found matching braces"; }
>
> POSIX says that the behavior of this code is undefined because of the
> stray '{'.  However, scripts like this work as expected with gawk, as
> well as with most other awks.

Paul understates the problem.  Turning interval expressions on breaks
old programs, period.  There are programs in the A,K,&W awk book that
don't work with interval expresssions enabled.  I'm not about to
make them on by default.

> gawk should do what GNU grep does: namely, support the POSIX
> requirement only when it is absolutely required, and otherwise treat
> stray braces as literal braces.  POSIX allows this behavior.  Here is
> a quote from the grep manual that should help explain things better:
>
>       GNU `egrep' attempts to support traditional usage by assuming that
>    `{' is not special if it would be the start of an invalid interval
>    specification.  For example, the shell command `egrep '{1'' searches
>    for the two-character string `{1' instead of reporting a syntax error
>    in the regular expression.  POSIX.2 allows this behavior as an
>    extension, but portable scripts should avoid it.
>
> On my list of things to do is to add support for this to GNU regexp.c.
> That should make it easy to fix gawk to be POSIX-compliant here.

Gawk doesn't work this way, and I disagree that it should.  Right now,
you must use one of --posix, --re-interval or setting POSIXLY_CORRECT
in the environment to get interval expressions to work.  I think having
/a{2}/  be an interval expression but  /{.*}/ be literal is confusing.

Gawk's current behavior has been in place for many years and generates
little or no complaint.  My experience is that people don't use interval
expressions much with gawk; the --re-interval option was in place,
but DIDN'T WORK for at least a year or two until somebody noticed.

POSIX Compliance is a goal to meet when it's reasonable, but it's not
an overriding requirement, and I think the current behavior strikes
a reasonable balance.  If you want full POSIX compliance, put
POSIXLY_CORRECT in your environment, or set up a shell file, function
or alias that adds --re-interval.

As Paul points out, it's not something you can rely on for portability
to other awks, in any case, even if gawk isn't part of the picture
at all.

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]