help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Tue, 18 May 2004 23:01:01 +0200
User-agent: Mutt/1.4i

Hans Aberg <address@hidden>:
>
[snip]
> What is the Atox?

It's what all this is all about :)

It's a tool for writing plain-text-to-XML converters. See
http://atox.sf.net for more information.

> A dynamic language specification? Then perhaps you might make use of
> a dynamic parser, such as an Early parser.

I've used an Earley parser in an earlier (no pun intended) version.
That was a bit slow (probably in part because of my overly redundant
grammar at the time, and in part because it was implemented in
Python). I haven't been able to find an Earley parser generator that
is as easily available as Bison, and since Bison now has GLR parsing
(which can handle the same set of grammars as Earley, IIUC), I thought
it to be a better alternative.

What do you think would be the advantages of an Earley parser as
opposed to an GLR parser like Bison?

> Here is a reference;
[snip]

Thanks for the links.

> 
> >... into something like:
> >
> ><doc>
> >  <h1>Lorem ipsum 2 * 2 = 4 dolor sit amet</h1>
> 
> I am not sure how * is translated into 2 * 2 = 4; typo?

Eh -- yes. The idea was to have "2 * 2 = 4" in both; I started out
with an asterisk in both, but thought I'd expand on it to make it less
far-fetched. Not very successful there, I guess ;)

> Some inputs, though:
> 
> It seems that you identify blocks by newlines.

In this example, yes -- can't assume that in general, though. It would
be up to the user. (There are many cases where it would not be the
case.) This was just an example to show how (1) full parsing is
needed, and (2) which tokens are significant (as opposed to "plain
text") depends on the state of the full (e.g. context-free) parser.

> You might decide the lexer
> to recognize that, so that \n and \n\n+ are different tokens.

Sure. But that won't fix the general problem.

> Then, headers and such may require special lexer start conditions (=
> lexer contexts), just in order to get around the LALR(1) limitation
> of Bison.

Not sure exactly what you mean here -- LALR(1) is not a problem, as
far as I can see (especially since it is no longer a limitation, if
one uses GLR). The issue lies in skipping the non-significant tokens.

[snip]
> to be translated into a list, and not headers, because you know that
> headers are already exhausted as possibility, then a way to resolve
> that is to set special context variables telling which header
> nesting there is.

As I said, that's not the problem. This can easily be dealt with by
either GLR.

[snip]
> Conditions such as a list must have at least two elements can be
> hard to catch, or make practical use of, because it may require
> quite some lookahead for it to see it.

Sure -- that's what GLR is for, right? (Or backtracking, in my
current relatively simple LL(1)-like Atox parser.)

> If you want to make sure that the list numbering is correct, that
> will be checked semantically, in the parser actions.

That's not important to me at the moment. The only issue I'm trying to
resolve is discerning significant tokens from plain text (and this
"significance" depending on the parse state). I thought I would do
this by checking for legal lookahead tokens and passing those to Flex,
but if Bison can (as it has been said) ask for more than a single
lookahead token before the parse state changes, this list of legal
tokens may be invalid for those later tokens, and this won't work.

So I still don't really have a solution.

Thanks for the input,

- M

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]