help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Frank Heckenbach
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Wed, 19 May 2004 04:10:57 +0200
User-agent: semail 20040101

Laurence Finston wrote:

> > Laurence Finston wrote:
> > 
> > > I've found that it's worthwhile to use Flex when the actions perform
> > > more complicated processing and when there's no need to perform the
> > > kind of parsing Bison does.  I also found that simply putting together
> > > tokens and returning them doesn't require the power of Flex.
> > 
> > Depends on what tokens are. If they are simply sequences of
> > characters from some class, flex might be overkill. If the length of
> > the sequence depends on the content (say, `=', `>' and `>=') or
> > tokens have a more complicated structure (say, floating point
> > notation), regexes can be handy. OTOH, there are cases (such as
> > Magnus' perhaps) where regexes are not even sufficient ...
>
> I agree that regexps can be handy. I just don't need them for the scanner and
> parser that I'm working on now.
> As far as Magnus' problem is concerned, I've tried to explain why I think his
> approach won't work.  In addition, the Flex manual states that the use of
> trailing context in the regexps will slow the scanner down considerably.

Only variable trailing context, which means that the length of both
the regex and the context is variable.

> There
> is no away around this, it is implicit in the nature of regular expression
> matching.

Perhaps you got confused by the flex terminology. Trailing context
means context given with `/foo', and only variable length context is
dangerous. This is in no way implicit, and can usually be avoided
(I've always been able to so far).

(Flex matches the regex and the context at once, concatenated, and
then has to find the start of the context, i.e. the end of the
actual pattern. If either length is constant, it can easily find it
by counting characters from the beginning or ending. Only if both
are variable-length, it has to do more complex matching.)

> Actually, my scanner handles both multi-character operators and
> floating point notation. I don't take any credit for this, though, since I've
> simply used the method Knuth developed for Metafont. 

I don't doubt that it can be done. I've seen both ways, and I find
the regexes easier to read and write (and more compact) when it gets
more complicated. In addition, if tokens are similar, I can write
them separately in flex, while a manual scanner has to tell them
apart or treat them together partially. (Just like the flex
generated scanner does, but automatically and "invisibly".)

> If my input was `point p;' and my rule was
> `<type> <variable> <semi-colon>', the scanner returned
> `<type>' and `<variable>', but the semi-colon was lost.  If my input was
> `point p ;', then it wasn't.  The first input also worked if the rule
> was changed to `<type> <variable_with_semi-colon>'.  I fiddled quite a bit
> with the regexps and I couldn't get it to work.

Well, I don't know what's going on there, but normally this should
work with any reasonable version of (f)lex.

> > The grammar, and also the language, are still well-defined, but the
> > set of possible parsings for some inputs may be ambiguous -- which
> > matters, of course, for semantic values. In GLR, such cases can be
> > resolved using `dprec'. (But that's not always necessary when using
> > GLR. Sometimes GLR in effect just provides for larger look-ahead, so
> > grammars that are unambiguous, but not LALR(k) can be parsed.)
>
> Thank you for your explanation.  Would you please tell me what the 
> abbreviations "LR", "GLR", and "LALR" stand for?  I would have thought that it
> wouldn't make sense to speak of a "grammar" apart from a "set of parsing
> rules", i.e., that the terms would be equivalent.

For details, there's a lot of literature out there. Yes, a grammar
is a set of parsing rules, and CFGs have certain restrictions on the
form of the rules (basically all "reasonable" grammars for computer
languages are of this form). LR, LALR and GLR are parsing
algorithms. The former have linear runtime, but can't parse all CFGs
(in short, they have to make certain decisions when they can only
look ahead k tokens for LALR(k) -- bison implements LALR(1)). GLR
can parse all non-ambiguous CFGs (and provides some means to deal
with ambiguous grammars, see below), but at the cost of possibly
exponential time and space -- though in parctical cases this
generally doesn't arise, and it's possible to tell this statically
from the grammar. (Basically, if the critical sections -- those
which can't be handled with k-lookahead -- don't nest, it's still
linear. If there aren't too many of them, and they don't overlap
heavily, it's linear with a reasonable constant.)

> I would also have thought
> that ambiguity in the rules would imply that there could be more than one path
> through the rules for one or more inputs. If the result of parsing can be said
> to be a "sentence" in the language defined by the syntactic rules and the
> semantic actions, then the choice of path through the rules would affect both
> the syntax and the semantics of the "sentence".

Especially the semantics. In computer languages, ambiguity is bad,
of course. If nothing else helps, a "brute force" disambiguation
rule can be used which GLR supports using `dprec' (see the C++
example in the bison manual under GLR).

OTOH, Earley parsers find all possible parses (AFAIK, this isn't
very useful for computer languages, but more for parsing of natural
languages which just happen to be ambiguous).

> However, I'm not familiar
> with the literature on the subject, so it's quite possible that this doesn't
> fit in with "informed opinion".  I haven't _had_ to read up on it, although
> I'm sure it would be interesting and worthwhile, because I find that Bison
> works very well (and has a good manual, too).

This book is quite nice: Dick Grune et al.: "Parsing Techniques, A
Practical Guide" <http://www.cs.vu.nl/pub/dick/PTAPG/>.

Frank

-- 
Frank Heckenbach, address@hidden
http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]