help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Mon, 17 May 2004 04:30:10 +0200
User-agent: Mutt/1.4i

Laurence Finston <address@hidden>:
>
> >
> > Sure, that's not a problem. What it means semantically isn't
> > fixed, though. (Just think of binary vs. unary minus, for
> > example.)
> 
> Binary vs. unary plus and minus are context-dependencies.  In order
> to make them work in Bison, it's necessary to use the `%prec'
> feature, as described in the manual.

Yes, I know.

[snip]
> I suspect that a rule for "anything else that doesn't fit into the
> grammar" is very likely to cause lots of conflicts.

Certainly. My hope was to solve that with some form of dynamic
precedence -- but I can't find any reasonable way of doing that with
Bison.

[snip]
> For what it's worth, I finally abandoned the use of Flex and have
> never regretted it for a moment.

I need to find occurrences of regular expressions in the text (or some
similar specification of patterns in the text, that can be supplied by
the user). What do you use? Another regexp library? Or do you use
grammar rules on single characters?

[snip]
> I've found that it doesn't really pay to use Flex when one is using
> Bison, since in this case, the scanner mostly just returns tokens to
> yyparse().  

Sure -- but you still need to scan, right? How do you do that? (Am I
missing some scanning feature in Bison? Are you suggesting writing
context-free grammars for the tokens?)

[snip]
> If you're using Bison in this way,  then it seems to me that you
> would need two scanners:  one for the input used to generate the
> Bison input file and one for the input to the final parser.

The first one isn't really relevant to the discussion at hand, but in
case you're interested, I'm using an XML language for it.

> I think writing a Bison grammar for generating another Bison
> grammar, with the high degree of generality you seem to want  would
> be an extremely complex task.  

As I have already pointed out, I have already implemented it all. It
works. I just want to port it from my hand-coded parser generator to
Bison, for the sake of speed.

The code for detecting possible lookahead tokens seems to be the last
piece in the puzzle -- if it works, I don't think there are any
stumbling blocks left, really. (Famous last words, I know ;)

[snip]
> An ambiguous context-free grammar is invalid.

I've never heard any definition of context-free grammars that preclude
ambiguity. In fact, one of the main features of general parsers that
can parse any context-free grammars (such as the Earley parser) is
that they'll find all possible parses of such ambiguous grammars.

Or am I wrong?

[snip]
> The manual goes into this in detail.  

Sure, Bison can't deal with ambiguity (except using defaults, or,
in a local way, using GLR). My hope was that I could deal with the
ambiguity by some dynamic precedence that would tell it that anything
trumped (halted) a wildcard. I am quite confident that it ought to be
possible; it just doesn't seem to be easily achievable with Bison.

[snip]
> I'm not familiar with Python, so I don't know how it's non-greedy
> closure operator works.

It's not particular to Python. It's used in Perl too -- and I would
suspect many other places too. It's very easy; for example, the
pattern .*?y (a non-greedy repetition of a wildcard followed by a
'y') would match the first "xy" in the string "xyxy", while the greedy
version, .*y, would match the entire string.

Very simple, efficient, and useful. I'm basically looking for an
equivalent in context-free parsing.

> It seems to me that trying to match a string that doesn't contain
> one of a number of substrings listed in a table  would require many
> lookups in the table;

Not at all. Just search for the disjunction of the strings (as a
regular expression -- this can be done efficiently). The position
before the match will be the last index of the unmatched
(wildcard/plain text) region.

> at worst, one lookup for every character scanned.

That would require a very inefficient implementation indeed.

> I think it might be easier to scan the entire output once and insert
> markers at appropriate places rather than to try to handle this
> problem on the fly. 

But, as I've tried to state (repeatedly), this won't work. It can't
work. For example, a headline may not have a sub-production allowing
for emphasis, so if I find an asterisk in it, that must actually be an
asterisk, and not an emphasis marker (as it would be in a normal
paragraph). This could not possibly be determined by a simple scanner,
because the headline-status of a piece of the document depends on a
full parse.

This is the crux of the discussion; if it had been as simple as you
suggest (simply scanning through the document, looking for proper
tokens) I'd have the problem solved long ago, and wouldn't have to ask
here :)

(It may be, however, that I'm misunderstanding your suggestion, of
course.)

[snip]
> What's the difference between the last exclamation point and the first two? 

The fact that the last one comes last -- and that is the point.

> If there is one, it must be specified in the rules.  If not, then it
> won't be possible for the scanner to break this string into the two
> tokens "Foo! Bar! Baz" and "!".

And it certainly shouldn't -- which has been my point all along. As
long as the properties involved are context-free, the scanner won't be
*able* to. (This wasn't the best example of that, though :)

Thus, it would be up to the parser to join the various "Foo" and "!"
tokens of the first part into a single production.

[snip]
> I think markup is by definition explicit.

Well, feel free to give what I'm doing another name, then. I would
have thought the term "implicit markup" would be clear enough. The
point is that I'm trying to elicit document structure from plain-text
documents and add explicit markup (in the form of XML).

(Or -- I'm not just trying; I'm already doing it. I'm just trying to
get Bison on my team ;)

> I also think that it might be easier just to mark up a document
> rather than try to define rules for parsing a sparsely marked-up
> file.

It would certainly be easier on the parser. I, however, would like to
make the job easier for the author.

I've been using markup formats like LaTeX and XML (and other ad hoc
formats) quite a lot. They're fine, but there is (IMO) a need for
simpler, sparser, more plain-text-like markup. Many have argued for
this, and many specific implementations exist. My idea was to create a
generic engine for this sort of thing, instead of simply inventing my
own format, like everyone else.

(More about this, and references to lots of specific formats, in the
docs at atox.sf.net.)

> However, I wish you success all the same.

Thanks :)

> I'm sorry, but I inadvertantly deleted the last email you sent.
> There was something I wanted to respond to, but I can't remember
> what it was now.  Would you please sent it to me again? 

Sure.

> Thanks. (Please don't repost it to help-bison, though!) 
> 
> Laurence

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]