help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Sun, 16 May 2004 00:49:49 +0200
User-agent: Mutt/1.4i

Laurence Finston <address@hidden>:
>
> If you don't signal "begin plain text" and "end plain text" somehow,
> e.g., with balanced single quotes, as in your example, it would be
> impossible for you to have plain text that contains tokens that
> would be valid in parser rules.

Sure -- I do signal it. I just mean that there is no fixed such token;
it all depends on context. (That doesn't mean that the grammar is/need
be context sensitive, though.)

For example, indentation might mean a code block (beginning of a code
block which can contain all tokens except a dedent) -- *except* if the
indented part begins with a dash, in which case it is a list item.

This sort of thing already works well in the existing implementation
of Atox, where I have hand-coded an LL(1)-like parser with specific
functionality for this behavior.

The reason I'm looking into Bison is that I'd like to speed up Atox. I
started by looking at Flex for a more efficient tokenizer, but using
Bison also seems like a good idea, if it's up for the task (as I would
have thought it would be).

> It would be possible to look up tokens in `yylex()' before returning
> them to `yyparse()'.  If they don't appear in a list of valid
> tokens, then you could output them verbatim until you reach one that
> does.

But again, this would depend on the parsing. The grammar (which will
decide what tokens can be used as markup in a given context) is
specified by the user, so I can't make any assumptions about it. (Atox
is, in many ways, a parser generator.)

> I use a similar technique (for a different purpose) in my parser
> with a C++ `map'.  You could, of course, do the same thing with C,
> with a bit more effort.  However, this alone will only work as long
> as the "plain text" doesn't contain any valid tokens, unless you
> signal that they are _not_ to be interpreted as such. 

Yes, that's the problem. I cannot say that globally. I need the
full power of the grammar to decide.

> Is there a particular reason why you've chosen to use Bison for this
> purpose?

Explained above. Some features I need to parse require a context-free
grammar (or some nasty hacking with lookahead in regular expressions
etc.) -- and many would at least be prettier with it (such as deciding
that a block is a heading because it ends with an underscoring line of
dashes).

> If the proportion of plain text to markup is high, I would consider
> writing a macro processor (like TeX) to handle it.

The point is that Atox uses user input to create the markup engine; so
I'd have to generate a custom macro engine for the user. Also, it
would only be able to handle a much simpler language (i.e. no nested
quoting etc., which is supported in the current version).

So, what I'm basically looking for is (as my original subject stated)
a non-greedy wildcard mechanism in Bison (or, for that matter, some
other parser generator -- Bison just seems readily available for the
user). It's parallel to regexp for matching strings, for example; you
can either use "[^"]*" or you can use ".*?". Not much of a difference
in this case, but if the terminating pattern were more complicated,
the non-greedy version would be the only valid one. For example,
[[.*?]].

The same thing goes for my problem. I'd like to keep shifting into the
wildcard until a sequence of tokens is found that will actially match
a production that can legally follow the wildcard.

The alternative would be the equivalent to the "[^"]" solution. It's
not fully satisfying, but I think it would be intuitive enough for the
user. The challenge then would be to find out which tokens can
terminate the wildcard in each context (either analyzing the grammar
myself, or somehow accessing the LR table).

If Bison isn't up to the task (or I'm not up to the task of hacking
it's internals enough ;), I guess I could write my own LL(1)-like
parser which could tackle it. (I have, as I said, already done that in
Python; doing it in C should only be a minor practical obstacle.)

But it would be really nice if Bison *could* do this, of course.

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]