help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Frank Heckenbach
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Wed, 19 May 2004 14:49:19 +0200
User-agent: semail 20040101

Laurence Finston wrote:

> On Wed, 19 May 2004, Frank Heckenbach wrote:
> 
> > Laurence Finston wrote:
> 
> >
> > > >
> > > I agree that regexps can be handy. I just don't need them for the scanner 
> > > and
> > > parser that I'm working on now.
> > > As far as Magnus' problem is concerned, I've tried to explain why I think 
> > > his
> > > approach won't work.  In addition, the Flex manual states that the use of
> > > trailing context in the regexps will slow the scanner down considerably.
> >
> > Only variable trailing context, which means that the length of both
> > the regex and the context is variable.
> >
> > > There
> > > is no away around this, it is implicit in the nature of regular expression
> > > matching.
> >
> > Perhaps you got confused by the flex terminology. Trailing context
> > means context given with `/foo', and only variable length context is
> > dangerous. This is in no way implicit, and can usually be avoided
> > (I've always been able to so far).
> 
> I just checked the Flex manual, and at
> the beginning of Chapter 17, "Performance Considerations", on page 45,
> it states that "arbitrary trailing context" and "pattern sets that
> require backing up" as the second and third-most expensive
> "options/actions" that degrade performance.  "beginning of
> line operator" is the eighth-most expensive.  What I was thinking of
> was the problem of matching a regexp like
> ".*^(`a' || 'bb' || 'ccc' ...)" standing for "any string not ending in
> `a', `bb', `ccc', etc.  This seems to be what Magnus wants, where the
> list "a, bb, ccc" is specified by the user and can vary depending on
> context.  This will be expensive no matter what, not just because of
> the backing up, but because of all the table lookups.

Several points:

- Do you mean `/' instead of `^'? (`^' meaning beginning-of-line can
  only occur at the start of the pattern, otherwise it means the `^'
  character literally).

- Supposing you mean `/', this does not stand for "any string not
  ending in `a', `bb', `ccc', etc.", but rather "any string followed
  by `a', `bb', `ccc', etc.". This is quite different, as in
  `fooabbccc', it would match `fooabb' (followed by `ccc'),
  according to the longest-matching rule.

- What Magnus wants is indeed the former, that's why I said there is
  no direct way in flex to do that.

  Magnus suggested to use standard regex functions to search for
  `a|bb|ccc', and return the characters before the match as the
  "plain text" token. This should work, but is not possible in flex,
  since flex always matches the patterns from the current position
  (whereas usual regex functions match them anywhere in the string).

  An alternative is to just `.' as the last pattern in flex. This
  will work as a fallback rule (due to the longest-match and
  first-match-of-equal-length rules in flex), with the disadvantage
  that it only gets single plain-text tokens (though it's possible
  to reassemble them before giving them to the parser).

  (BTW, "specified by the user" doesn't matter if the scanner is
  generated from that specification. As for "vary depending on
  context", I think I've just convinced him not to. ;-)

> > (Flex matches the regex and the context at once, concatenated, and
> > then has to find the start of the context, i.e. the end of the
> > actual pattern. If either length is constant, it can easily find it
> > by counting characters from the beginning or ending. Only if both
> > are variable-length, it has to do more complex matching.)
> 
> No matter what, it will still have to compare pairs of characters and
> if it has to back up, it will compare one or more characters from the
> input more than once.  Depending on how much pattern matching must be
> done, how complex the patterns are, and the length of the input, this
> could lead to a serious degradation of performance, and probably will,
> in my opinion.

Flex performance generally does not depend on the number or
complexity of the patterns (since DFAs are executed in linear time,
once constructed which happens when flex runs, not at program run
time). Only certain problematic features (namely REJECT and
variable-length trailing context) make matching more complex. The
other "performance considerations", AFAICS, only add a constant
overhead per token (such as scanning the buffer for newlines with
`%option yylineno' in some circumstances).

> > > I would also have thought
> > > that ambiguity in the rules would imply that there could be more than one 
> > > path
> > > through the rules for one or more inputs. If the result of parsing can be 
> > > said
> > > to be a "sentence" in the language defined by the syntactic rules and the
> > > semantic actions, then the choice of path through the rules would affect 
> > > both
> > > the syntax and the semantics of the "sentence".
> >
> > Especially the semantics.
> 
> Why "especially"?

That's what one wants to get from parsing. The syntax is used during
parsing.

Frank

-- 
Frank Heckenbach, address@hidden
http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]