help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Sun, 16 May 2004 10:30:43 +0200
User-agent: Mutt/1.4i

Laurence Finston <address@hidden>:
>
[snip]
> > Sure -- I do signal it. I just mean that there is no fixed such token;
> > it all depends on context. (That doesn't mean that the grammar is/need
> > be context sensitive, though.)
> 
> This seems to be a contradiction to me. In a context-free grammar a
> token, i.e., a terminal symbol, is unique and unambiguous.

Sure, that's not a problem. What it means semantically isn't fixed,
though. (Just think of binary vs. unary minus, for example.)

The tokenization isn't really my problem.

> The Bison manual describes a way of handling context dependencies,
> but it's fairly tricky. 

Yes, I've seen that (or so I think -- depends on what you mean). There
isn't really any context dependencies here in that way; the wildcard
rule can (at least in the simplest version) be generated as a normal
context-free production. It's just that I have to do some extra
analysis to do it.

> From what you've said about the problem you're trying to solve, I
> have my doubts about whether Bison is the right tool for the job.

Maybe you're right. I just thought that if I could do it with an LL(1)
parser with a little extra functionality (as I've already done), it
should be certainly be possible to use an LALR parser. I guess I could
just use Flex and write my own parser engine on top of it.

> > decide what tokens can be used as markup in a given context) is
> > specified by the user, so I can't make any assumptions about it.
> > (Atox is, in many ways, a parser generator.)
> 
> It sounds to me like you want Bison to generate a parser generator,
> but that's not its intended purpose.  

Bison *is* a parser generator -- I'm trying to use it as a back-end.
In other words, I read an input format, do some processing, and feed
code to Flex and Bison in order to produce a parser for the user.

How this is outside the scope of Bison, I don't see.

The specifics of the parsing job may be unsuitable, though, of course.

> > Yes, that's the problem. I cannot say that globally. I need the
> > full power of the grammar to decide.
> 
> I don't understand how you're using the term "grammar".   To me,
> "grammar" (at least of the context-free variety) implies that there
> are syntactical elements with a fixed value. 

Grammars don't concern themselves with value (of the semantical kind),
do they? The documents I'm trying to parse can certainly be described
with context-free grammars (I've tried to explain how, but I guess I
haven't been quite clear). The problem is either (1) analyzing the
input-grammar in order to produce a wildcard that doesn't contain any
relevant terminating tokens, or (2) making Bison handle an ambiguous
grammar.

[snip]
> According to my understanding of regular expression matching
> algorithms, they are by nature "greedy" in that they attempt to
> match the longest possible string. 

By "nature" is perhaps a bit strong. Most are by "default", yes. In
the current Atox, I'm ussing the non-greedy closure operator in Python
quite a lot.

> I believe that a "wildcard" facility makes sense in a scanner, but
> not in a parser.

That doesn't help me much, my language (and the wildcard) isn't
regular -- it's context-free at best.

> Parsing tries to create a hierarchical structure from input; if
> elements are read that don't fit into the structure, and error
> recovery doesn't discard them or make them fit, it fails.

Sure. I don't see how this conflicts with my goal? Just take a simple
example -- let's say I want to parse a string that ends with an
exclamation point. I might get an input like this:

  Foo! Bar! Baz!

I'd like to parse this into a hierarchical structure which would let
me (at the topmost level) get the "Foo! Bar! Baz" and "!" parts
separately. This can certainly be described by a context-free grammar
(although not LL(1)) but not by a regular language (if the string up
to the exclamation point is arbitrary). So in order to handle this
sort of thing, I'd need something stronger than a scanner.

> It's possible to change the way  the scanner handles input at
> run-time, and to have it return different tokens depending on
> context, but it's not possible to change the meanings of tokens
> within the parser.

I don't need to.

> From what I know of markup languages, there's no need for this kind
> of hierarchical structure.

I've tried to explain why I need this. I may have rules that say, for
example, that the first single-line paragraph of the document is the
title, but only if there is no multi-line paragraph before it, for
example. (There are many, many other similar examples.)

> While the resulting documents may be structured hierarchically, the
> input can be processed sequentially.   

If you have a very simple markup language, yes. I'm trying to add
markup to a plain text file with mostly *implicit* markup, which is
why I need some more complex rules. I've simpler versions, but they
haven't been satisfactory.

> Laurence

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]