emacs-pretest-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Stack overflow in regexp matcher


From: Stefan Monnier
Subject: Re: Stack overflow in regexp matcher
Date: Sun, 01 Apr 2007 14:50:47 -0400
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.96 (gnu/linux)

> The regexp in question looks like: ".*some text.*". When used with
> string-match, removing both ".*"s shouldn't affect the matching
> semantically, right?

Differences (off the top of my head):
- match-beginning (as well as the return value of string-match)
- .* only matches chars other than \n
- .* matches greedily, so (string-match "foo" "foofoo") will only match the
  first "foo", whereas (string-match ".*foo" "foofoo") will match the whole
  string (the .* matches the first "foo").

> If so, would it be wise to add a regexp simplification step that, for
> example, could remove leading and trailing ".*"s in regular expressions
> passed to string-match?

In the code that calls string-match, yes, but within string-match it would
be terribly difficult to figure out whether the optimization is harmless.

> I realise now is not the time to talk about feature additions, but it
> would certainly make my life simpler if any regular expression that *can*
> be compiled to a DFA is.

I don't think you'll find much disagreement here.

> (I also wonder where "regular expressions" that are more than regular
> came from.  Just because they are useful?)

No idea.  Originally, only the \N backref was non-regular, but PCRE
introduced many others.  Note that using a DFA to implement unix
regexp-matching is actually pretty difficult because of the precise
semantics of sub-group matching.  So most "early" implementations used
backtracking, which made non-regular extensions "free".


        Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]