help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Laurence Finston
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Sun, 16 May 2004 23:29:18 +0200
User-agent: IMHO/0.98.3+G (Webmail for Roxen)

>
> Sure, that's not a problem. What it means semantically isn't fixed,
> though. (Just think of binary vs. unary minus, for example.)

Binary vs. unary plus and minus are context-dependencies.  In order to make
them work in Bison, it's necessary to use the `%prec' feature, as described in
the manual.

> 
> The tokenization isn't really my problem.
> 
> > The Bison manual describes a way of handling context dependencies,
> > but it's fairly tricky. 
> 
> Yes, I've seen that (or so I think -- depends on what you mean). There
> isn't really any context dependencies here in that way; the wildcard
> rule can (at least in the simplest version) be generated as a normal
> context-free production. It's just that I have to do some extra
> analysis to do it.
> 

I suspect that a rule for "anything else that doesn't fit into the grammar" is
very likely to cause lots of conflicts.  But maybe I'm wrong.

> > From what you've said about the problem you're trying to solve, I
> > have my doubts about whether Bison is the right tool for the job.
> 
> Maybe you're right. I just thought that if I could do it with an LL(1)
> parser with a little extra functionality (as I've already done), it
> should be certainly be possible to use an LALR parser. I guess I could
> just use Flex and write my own parser engine on top of it.
> 

For what it's worth, I finally abandoned the use of Flex and have never
regretted it for a moment.  It swallowed the look-ahead token when I used the
C++ scanner classes. It might have been possible to find out why, but it
didn't seem worth it for reasons I won't go into here. 

I've found that it doesn't really pay to use Flex when one is using Bison,
since in this case, the scanner mostly just returns tokens to yyparse().  
In addition, if I had kept using Flex,  I believe that it would have been more
difficult to implement some of the features that I've now implemented in my
own version of yylex().

> > > decide what tokens can be used as markup in a given context) is
> > > specified by the user, so I can't make any assumptions about it.
> > > (Atox is, in many ways, a parser generator.)
> > 
> > It sounds to me like you want Bison to generate a parser generator,
> > but that's not its intended purpose.  
> 
> Bison *is* a parser generator -- I'm trying to use it as a back-end.
> In other words, I read an input format, do some processing, and feed
> code to Flex and Bison in order to produce a parser for the user.

If you're using Bison in this way,  then it seems to me that you would need
two scanners:  one for the input used to generate the Bison input file and one
for the input to the final parser.  I think writing a Bison grammar for
generating another Bison grammar, with the high degree of generality you seem
to want  would be an extremely complex task.  

> 
> How this is outside the scope of Bison, I don't see.
> 
> The specifics of the parsing job may be unsuitable, though, of course.
> 
> > > Yes, that's the problem. I cannot say that globally. I need the
> > > full power of the grammar to decide.
> > 
> > I don't understand how you're using the term "grammar".   To me,
> > "grammar" (at least of the context-free variety) implies that there
> > are syntactical elements with a fixed value. 
> 
> Grammars don't concern themselves with value (of the semantical kind),
> do they? 

I meant syntactical value, like parts of speech in a natural grammar. 

> The documents I'm trying to parse can certainly be described
> with context-free grammars (I've tried to explain how, but I guess I
> haven't been quite clear). The problem is either (1) analyzing the
> input-grammar in order to produce a wildcard that doesn't contain any
> relevant terminating tokens, or (2) making Bison handle an ambiguous
> grammar.
> 

An ambiguous context-free grammar is invalid.  Ambiguities must be resolved
somehow.  Bison does this by always shifting in the case of a shift/reduce
conflict and always reducing by the first rule that appears in the grammar in
the case of a reduce/reduce conflict. It's neater, of course, if there are no
ambiguities, but this is difficult to achieve in practice (as I know from
experience).
The manual goes into this in detail.  

> [snip]
> > According to my understanding of regular expression matching
> > algorithms, they are by nature "greedy" in that they attempt to
> > match the longest possible string. 
> 
> By "nature" is perhaps a bit strong. Most are by "default", yes. In
> the current Atox, I'm ussing the non-greedy closure operator in Python
> quite a lot.
> 

I'm not familiar with Python, so I don't know how it's non-greedy closure
operator works.  It seems to me that trying to match a string that doesn't
contain one of a number of substrings listed in a table  would require many
lookups in the table; at worst, one lookup for every character scanned.

I think it might be easier to scan the entire output once and insert markers
at appropriate places rather than to try to handle this problem on the fly. 
However,  if users can interactively change the meanings of items of input,
i.e., redefine what tokens are returned for a given item of input, the first
scanning pass would have to account for these changes, too.  In this case, I
think it might not be worth it, because of the added complexity. Of course,
inserting markers implies copying the files, so there would be a potentially
high cost in memory.

> > I believe that a "wildcard" facility makes sense in a scanner, but
> > not in a parser.
> 
> That doesn't help me much, my language (and the wildcard) isn't
> regular -- it's context-free at best.
> 
> > Parsing tries to create a hierarchical structure from input; if
> > elements are read that don't fit into the structure, and error
> > recovery doesn't discard them or make them fit, it fails.
> 
> Sure. I don't see how this conflicts with my goal? Just take a simple
> example -- let's say I want to parse a string that ends with an
> exclamation point. I might get an input like this:
> 
>   Foo! Bar! Baz!
> 
> I'd like to parse this into a hierarchical structure which would let
> me (at the topmost level) get the "Foo! Bar! Baz" and "!" parts
> separately. This can certainly be described by a context-free grammar
> (although not LL(1)) but not by a regular language (if the string up
> to the exclamation point is arbitrary). So in order to handle this
> sort of thing, I'd need something stronger than a scanner.
> 

What's the difference between the last exclamation point and the first two? 
If there is one, it must be specified in the rules.  If not, then it won't be
possible for the scanner to break this string into the two tokens
"Foo! Bar! Baz" and "!".

> > It's possible to change the way  the scanner handles input at
> > run-time, and to have it return different tokens depending on
> > context, but it's not possible to change the meanings of tokens
> > within the parser.
> 
> I don't need to.
> 
> > From what I know of markup languages, there's no need for this kind
> > of hierarchical structure.
> 
> I've tried to explain why I need this. I may have rules that say, for
> example, that the first single-line paragraph of the document is the
> title, but only if there is no multi-line paragraph before it, for
> example. (There are many, many other similar examples.)
> 

> > While the resulting documents may be structured hierarchically, the
> > input can be processed sequentially.   
> 
> If you have a very simple markup language, yes. I'm trying to add
> markup to a plain text file with mostly *implicit* markup, which is
> why I need some more complex rules. I've simpler versions, but they
> haven't been satisfactory.

I think markup is by definition explicit.  I also think that it might be
easier just to mark up a document rather than try to define rules for parsing
a sparsely marked-up file. However, I wish you success all the same.

I'm sorry, but I inadvertantly deleted the last email you sent.  There was
something I wanted to respond to, but I can't remember what it was now.  Would
you please sent it to me again? 
Thanks. (Please don't repost it to help-bison, though!) 


Laurence



reply via email to

[Prev in Thread] Current Thread [Next in Thread]