help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Frank Heckenbach
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Wed, 19 May 2004 04:35:09 +0200
User-agent: semail 20040101

Magnus Lie Hetland wrote:

> Frank Heckenbach <address@hidden>:
> >
> [snip]
> > > This has to be inferred from the fact that the only legal token at
> > > that point would be a closing quote -- everything else should be
> > > shifted into the verbatim/plain text production.
> > 
> > Sure. You'll have to do it some way. Set-of-all-tokens -
> > set-of-terminating-tokens-for-this-rule -
> > set-of-tokens-starting-markup-allowed-here =
> > set-of-tokens-to-be-treated-as-plain-text-here. Should be rather
> > automatic.
> 
> Hm. I suppose so. I was worried that some rules might give me trouble;
> for example, if a rule ends with a wildcard, finding out which
> terminating token is in effect would require knowledge about the parse
> state which cannot be inferred statically (at least as far as I can
> see), because the terminating token belongs to another production.
> 
> But as long as I stay away from rules that end with wildcards, I guess
> the analysis is quite straightforward. (Are there other pitfalls?) And
> this was, in fact, the approach I considered earlier.

Of course, in a CFG grammar, you could define rule endings in CFG
terms even (say, markup is ended by a correctly formed arithmetic
expression, while a partial expression is considered as plain text
-- which is probably completely crazy ;-). In this case I suppose
dprec is your only hope (parse an expression and a sequence of
tokens in parallel, give the expression dynamic precedence, so the
sequence of tokens applies only if the expression fails to parse),
but if it can be avoided, I'd rather stay away from it. (I don't
think such languages would be very intuitive to use either ...)

For wildcards (i.e., regexps AIUI), you could at least in theory
construct the negated wildcard (since regexps are closed under
complement, unlike CFGs) for the non-terminator, in order to parse
unambiguously (i.e., without dprec), but I'm not sure you'd want to
do that (NFA, DFA, regex, CFG), unless they're simple enough ...

> > > > This example was especially difficult because context-dependent
> > > > tokens could overlap. If you have only single characters or so, it
> > > > will be easier.
> > > 
> > > I don't, sadly. I have things like lines of dashes underlining
> > > headers, as well as emdashes written "--", in the samples. In general,
> > > the user may specify anything.
> > 
> > Also in the same grammar? So it could be either a header or a word
> > and an emdash?
> 
> That would be up to the user -- who would quite likely also have to
> specify some way of breaking the ambiguity. My current parser is an
> LL(1)-like parser that tries things top-to-bottom, making things
> defined earlier in the grammar trump those defined later.

With some luck you can translate this into dprec rules, but since
this would be a real ambiguity, this is the best you can expect.

> Just a couple of thoughts at the end:
> 
>   1. Adding some domain-specific assumptions might make the system
>      easier to use, and more efficient. For example, I could use a
>      two-tiered parsing strategy, one working on the level of blocks
>      (paragraphs, headings, list items etc.) separated by "vertical
>      whitespace" (with the kinds of blocks being defined by various
>      properties such as "starts with a numeric label" or "ends with a
>      line of dashes"), and the other working with inlines inside the
>      blocks, with different grammars for each block type.

If this works, fine. If, however (I don't know if the grammars
really get *that* strange, but from your examples so far, they might
;-) whether a block is a heading (and not, say, a list item) might
depend on the following block, and parsing is different for
headings, this decomposition might not work.

>   2. I just discovered the GLR* algorithm by Lavie and Tomita -- a
>      generalization of GLR that allows the parser to skip unparsable
>      parts of the input, and return the parse with the least number of
>      skipped tokens. This is *dangerously* close to mind-reading, and
>      might work in my case -- at least, it's very cool ;)

I don't know about this one. Your description looks *a bit* like
bison's error recovery. If you get to the sematic values of the
skipped tokens, it may help, but I don't really know ...

Frank

-- 
Frank Heckenbach, address@hidden
http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]