Re: regexp regressions

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp regressions

From:	Paul Eggert
Subject:	Re: regexp regressions
Date:	Fri, 26 Aug 2005 11:27:54 -0700
User-agent:	Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

Sam Steingold <address@hidden> writes:

> frankly I don't know and don't care whether the old or new was / is broken.
> All I care about is consistency.

I'm afraid that we won't get complete consistency when using various
regex implementations.  They don't agree about the fine details.  As
it stands, the POSIX regex spec isn't even clear about what variations
in behavior are permitted.  I've tried to have some of this fixed at
the POSIX, but have mostly failed.  Sorry, but committees often work
that way....

> May I suggest that you add regression testing to the parts of gnulib
> that exhibit non-trivial functionality, like regex?

gnulib regex is intended to be a mirror of glibc regex, and that would
be a better place for regression tests.

> Does glibc come with regression tests?
> Do those tests cover regex?

I assume it does, but I don't know offhand how good the tests are.

> Consistency over time - or at least explicitly documented changes -
> is quite important (IMNSHO).

The recent near-complete rewrite of glibc regex has caused quite a few
problems.  However, the rewrite was judged to be important, as the old
code did not support multibyte expressions.

> Actually, the careful examination of the examples appears to indicate
> that the previous behavior was "more" correct.
> Specifically, the first 3 of the 6 regressions are clearly bugs in the
> current regex implementation while the last 3 are acceptable - but
> undesirable - variations.

Let's go through the examples again and see if we agree.  I'm using
the CVS gnulib regex, along with the C program enclosed below.

FORM: (RE_TEST "(^)*" "-")
CLISP : ("" "")

Here I guess you wanted the "^" to be treated as an ordinary
character, not as the beginning-of-string pattern.  But POSIX requires
"^" to be the beginning-of-string pattern in this context; see
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04_09>
and see the ERE grammar
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_05_03>.
So the new output conforms to POSIX (and the old GNU regex behavior
did not).

FORM: (RE_TEST "([xyz])(-\2)" "x-y")
REGEXP:REGEXP-COMPILE ("Invalid back reference"): "Invalid back reference"

FORM: (RE_TEST "((.*)\1)+" "xxxxxx")
REGEXP:REGEXP-COMPILE ("Invalid back reference"): "Invalid back reference"

I don't see why this behavior is buggy.  First, we're talking about an
extension to POSIX here, as POSIX says that the meaning of \1 and \2
is undefined in EREs (see
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04_02>),
so the test is relying on unportable behavior.

Let's assume that \1 and \2 are supposed to be backreferences (the
most reasonable interpretation.  Then POSIX says that for BREs the
backreferences are invalid since fewer than n subexpressions precede
the backreferences (see
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_06>).
I guess that the above tests are assuming that, if an invalid
backreference is found, then the \1 should be silently treated like
"1" (i.e., the backslash should be ignored).  POSIX allows that
behavior, but I think the new regex behavior is more useful, as the
expression is most likely a typo.

FORM: (RE_TEST "(.*)*\1" "xx")
CLISP : ("xx" "x")

Here, making the unportable assumption that \1 is a BRE-style
backreference, then the old behavior ("xx" "") was incorrect, since
POSIX requires that subexpressions match the longest possible string,
left to right
<http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_01_02>,
and "x" is a longer match than the "" that your test expected.

FORM: (RE_TEST "(a*)*" "aaaa")
CLISP : ("aaaa" "aaaa")

FORM: (RE_TEST "(a*)+" "aaaa")
CLISP : ("aaaa" "aaaa")

The longest-leftmost match rule applies here, too, and again I think
the current behavior is correct and the old behavior was buggy.

[Prev in Thread]

Current Thread

[Next in Thread]

regexp regressions, Sam Steingold, 2005/08/05
- Re: regexp regressions, Paul Eggert, 2005/08/21
  - Re: regexp regressions, Sam Steingold, 2005/08/26
    - Re: regexp regressions, Paul Eggert <=

Prev by Date: Re: [bug-gnulib] lib-link.m4: AC_LIB_LINKFLAGS_BODY: --without-libFOO-prefix?
Next by Date: Re: Bug Report: sed-4.1.4 misinterprets uClibc's malloc (patch included)
Previous by thread: Re: regexp regressions
Next by thread: Typos in README
Index(es):
- Date
- Thread