Re: Non-greedy wildcard possible? (Long)

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Non-greedy wildcard possible? (Long)

From:	Frank Heckenbach
Subject:	Re: Non-greedy wildcard possible? (Long)
Date:	Tue, 18 May 2004 01:20:07 +0200
User-agent:	semail 20040101
Magnus Lie Hetland wrote:

> Just an example to show why a fixed set of markup tokens (that will
> end the plain text) won't do:
> 
>   Foo *bar 'baz *fee* fie' foe*.
> 
> Let's say the single quote represents verbatim text (code). Then the
> plain text between the two single quotes will contain two asterisks
> that are *not* to be interpreted as markup-tokens, while the asterisks
> outside should be interpreted as markup for emphasis.
> 
> This would be deducible from the grammar, because the code production
> rule wouldn't allow emphasis inside it. The lexer couldn't possibly
> know about this -- it would have to be handled as part of the parsing.

In this case, I'd just make the asterisk a different token
(context-independent) which is parsed as normal text inside the
verbatim rule and as markup otherwise. (This is still CFG, of
course.)

I'd generally be careful with content-dependent lexing. You have to
be careful about the lookahead, i.e. bison can request a new token
before executing the action concerning the previous tokens, and if
that action influences how the next token should be lexed, you have
a problem.

It gets worse when you use GLR: As long as the parser is split, it
doesn't execute any actions, but of course consumes tokens. From the
POV of the lexer this is like arbitrary lookahead. So unless you
know very exactly where your ambiguities are and how long the parser
can be split, and that during this time no lexer-critical actions
can occur, you cannot use content-dependent lexing at all.

BTW, been there, done that. In my case, I finally had to create a
few "mixed" tokens -- say that `ab' or `bc' could be tokens,
depending on context, now my lexer gives 3 tokens `a', `b', `c' when
it sees `abc', and the parser does things like `ab: TOKEN_A
TOKEN_B'. This is a bit simplified, since I have to make sure that
there's no whitespace between `a' and `b'. But on the whole, it
works out nicely, with LALR capable of dealing with the new rules
(i.e., they don't introduce new ambiguities for GLR to care about),
and some headaches less in the lexer/parser interaction (which was
quite hairy before, when using LALR and content-dependent tokens).
The only disadvantage is that the mixed tookens look a bit strange
and have no clearly defined meaning until they are parsed, but
that's just an aesthetical issue. This example was especially
difficult because context-dependent tokens could overlap. If you
have only single characters or so, it will be easier.

> Sure. I don't see how this conflicts with my goal? Just take a simple
> example -- let's say I want to parse a string that ends with an
> exclamation point. I might get an input like this:
> 
>   Foo! Bar! Baz!
> 
> I'd like to parse this into a hierarchical structure which would let
> me (at the topmost level) get the "Foo! Bar! Baz" and "!" parts
> separately. This can certainly be described by a context-free grammar
> (although not LL(1)) but not by a regular language (if the string up
> to the exclamation point is arbitrary). So in order to handle this
> sort of thing, I'd need something stronger than a scanner.

As you state it, it can be described by a regex: simply `(.*)(!)' to
get the two parts in the parentheses, or in flex: `.*/!' to get the
first part up to the last `!'. Greediness is your friend here. But I
suppose the "string" actually consists of other tokens, possibly
with a parsed structure already.

> > From what I know of markup languages, there's no need for this kind
> > of hierarchical structure.
> 
> I've tried to explain why I need this. I may have rules that say, for
> example, that the first single-line paragraph of the document is the
> title, but only if there is no multi-line paragraph before it, for
> example. (There are many, many other similar examples.)

Apart from the fact that such a rule is quite fragile (if someone
deletes a few words from the first multi-line paragraph, it may
become one-line and surprisingly turn into a title), this one can be
solved with a CFG. It can also be solved with a global state
variable (paragraph_seen or similar). So that's no decisive argument
either way.

> [...]
>
> But, as I've tried to state (repeatedly), this won't work. It can't
> work. For example, a headline may not have a sub-production allowing
> for emphasis, so if I find an asterisk in it, that must actually be an
> asterisk, and not an emphasis marker (as it would be in a normal
> paragraph). This could not possibly be determined by a simple scanner,
> because the headline-status of a piece of the document depends on a
> full parse.

OK, here we have a more complicated example. This may be helpful to
get to the root of the problem. IIUIC, this is a line with emphasis
and a second (strange) line:

Foo *bar*
---+-----

And this is a heading without emphasis and two asterisks in the
text:

Foo *bar*
---------

If the heading and normal line require substantially different
parsing (as the emphasis does already -- if it's only this, it could
be kludged around, but I suppose there's more and you really need to
parse it differently), I see two ways of doing it: 1. with
look-ahead in the lexer (quite hairy, similar to content-dependent
lexing, not recommended), 2. using GLR parsing. If you make sure
that the language is actually unique (e.g., your lexer gives a
special token for `---' lines which is only parsed in headings),
then you might not need dprec etc. The scanner will fork, one branch
will run into a parsing error sooner or later (because one branch
will not acecpt the `---' token and the other one will require it),
and the other branch will continue. This is the "most harmless" way
of using GLR.

> Hans Aberg <address@hidden>:
> >
> > Normally, this is done by the lexer, sending back say a token "text". Why
> > is you not doing it that way?
> 
> Because what tokens are significant (as opposed to "plain text")
> is described by a user-specified context-free grammar.

Then, of course, the scanner (i.e., the flex input, if you use flex)
must be generated from the user's grammar as well, along with the
bison input.

> OK. Maybe so. But finding the first occurrence of a disjunction of
> strings (or patterns) can be done in linear time; once you have the
> first occurrence, you also know the length of the span that *didn't*
> match any of the patterns.

This might be a viable solution. AFAIK, flex doesn't support it, so
this would be a good reason for using another scanner, maybe
handwritten (using standard greedy regex functions and doing what
you described). Bison doesn't mind where it gets its tokens from.

This is especially practicable if your non-greedy wildcards are
actually `.*'. If not, you might have to do an additional regex
matching on that span; the alternative would be to actually
construct the NFA and DFA for the negated disjunction, conjuncted
with the regex. Certainly possible, but AFAIK not covered by any
standard regex code, so you'd have to do it all yourself.

Frank

-- 
Frank Heckenbach, address@hidden
http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Non-greedy wildcard possible? (Long), (continued)
Prev by Date: Re: Non-greedy wildcard possible? (Long)
Next by Date: Re: Non-greedy wildcard possible? (Long)
Previous by thread: Re: Non-greedy wildcard possible? (Long)
Next by thread: Re: Non-greedy wildcard possible? (Long)
Index(es):
- Date
- Thread