help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Mon, 17 May 2004 15:48:04 +0200
User-agent: Mutt/1.4i

Laurence Finston <address@hidden>:
>
[snip]
> > Certainly. My hope was to solve that with some form of dynamic
> > precedence -- but I can't find any reasonable way of doing that with
> > Bison.
> >
> 
> I don't think "dynamic" is a suitable term with reference to the
> grammar rules.

You should take that up with the documentation authors, then <wink>

The precedence used in GLR parsing, set by %dprec, is referred to as
"dynamic precedence" (hence the name) in the documentation.

> It is possible for `yylex()' to determine dynamically what tokens to
> return to `yyparse()', though.

Yes. That is a different matter, but also useful.

This is, in fact, the way I think I'll be doing things -- have Bison
tell Lex what tokens are legal, and have Flex return the first
occurrence among those. (That closely mimics the current Atox
behavior, as implemented in my custom parser generator.)

[snip]
> > Sure -- but you still need to scan, right? How do you do that? (Am I
> > missing some scanning feature in Bison? Are you suggesting writing
> > context-free grammars for the tokens?)
> >
> 
> `yylex()' is the scanner.  The method I'm using doesn't use regular
> expressions

Right. I need regular expressions, or some similar way of letting the
user specify patterns more general than simply fixed strings.

[snip]
> Apart from a few characters that are handled differently, all
> characters are divided into categories.

Hm. Yes, this is an interesting option, I suppose. If this way of
thinking can achieve what I'm looking for, there's no need for me to
be hung up on one specific tool, such as regexps.

> A `symbolic token' consists of one or more characters from a single
> category.

In my case, I suppose the user would have to specify the categories
(as Atox is to be as customizable as possible, in order to handle the
user's own format):

> The scanner collects characters into a string representing a
> `symbolic token' as long as they belong to single category.  If it
> scans a character from a different category, it terminates scanning.

This wouldn't solve my "wildcard" problems, but it might be an
interesting thing to consider for the tokenization. Regexp syntax
isn't exactly user friently, and might offer more functionality than I
really need.

[snip]

> I'll leave this to the computer scientists on this list to thrash
> out, if they care to.  From my point of view, if my grammar is
> ambiguous, Bison won't be able to generate a parser, unless its
> rules for resolving ambiguity are able to cope with it.  I don't
> think it would help me to _find_ all possible parses of an input, I
> want it to actually be parsed, and I want its actions to produce
> correct results.

Sure -- the hope was to use dynamic precedence to sort that out. That
would be much easier than writing a grammar that was unambiguous
(without any use of precedence) to begin with. This point is even made
by the Bison documentation.

> Therefore, my input (in combination with Bison's default behavior)
> must unambiguously specify a single set of rules for parsing.

Sure.

> If the terminator is a single character then the problem is fairly
> simple.

Indeed. The point is, of course, that it'll work with an arbitrary
regular expression as a terminator. And *that* is *exactly* what I'm
wishing for -- only for context-free grammars rather than regular
ones.

> The non-greedy version above is equivalent to `[^y]*y' in Emacs'
> regexp syntax (the one I'm most familiar with).  However, it would
> be difficult and maybe impossible to specify something like
> `.*^(abc)', meaning "any string not ending in `abc'".

This is the whole point of the non-greedy operator, as implemented
(among other places) in the pcre library of Perl and the equivalent in
Python.

> Maybe this is easy and efficient in Python, I don't know.

Well it's certainly easy. It's also described inthe "standard" book on
regexps from O'Reilly:

http://www.oreilly.com/catalog/regex/chapter/ch04.html

(I haven't read this thoroughly, though.)

> I believe that any features requiring backtracking are likely to be
> slow, though.

I must confess I don't know how it is implemented. Backtracking would
be a drawback, for sure. I think non-greedy behavior is possible to
implement with NFAs (which can translate into DFAs, of course), which
would main that there's no need for backtracking.

> > > at worst, one lookup for every character scanned.
> >
> > That would require a very inefficient implementation indeed.
> >
> 
> I think this depends on what characters can occur in input items.

OK. Maybe so. But finding the first occurrence of a disjunction of
strings (or patterns) can be done in linear time; once you have the
first occurrence, you also know the length of the span that *didn't*
match any of the patterns.

> (I say "input items" to distinguish them from the "tokens"
> returned to `yyparse()' by `yylex()'.)
> If there's no easy way to recognize separatations, e.g., if
> blank space is allowed within input items, and if input items
> can be prefixes of other input items, then this might be
> necessary.

This sort of separation would be irrelevant to a DFA (see above).

I'm talking about parsing "outside the box" here -- not neat "split
on whitespace" sort of computer language parsing... :)

[snip]
> Theoretically, you could insert the markers and decide on the second
> pass how to handle the asterisk.

Sure. I even thought about simply adding token-level markup and handle
aggregation and the like in XSLT. It's possible to do this sort of
thing, but as far as I can see the only result is added latency. Maybe
I'm just missing your point.

[snip]
> In what sense is it "last"? Because only whitespace follows it up to
> the next newline?  If `yyparse()' is to recognize it as different
> from the other two, then `yylex()' must return a different token to
> represent it.

I'm not sure how many times it's interesting to keep repeating this,
but still: No, yylex does not have to return a different token to
represent it. That's the whole point. It has to be decided by a
context-free grammar. That's why I'm asking about this. If the lexer
could decide this sort of thing for me, there would be no problem.

It seems we're just going in circles here :)

[snip]
> To quote Don Knuth, "Computers are good at following instructions;
> they're not good at reading your mind".

Sure. My grammar descriptions are 100% formal -- no mind-reading.

And they work. So the question isn't whether a computer can read my
mind; the question is whether Bison can perform as well as a parser
generator as the simple LL(1)-like one I have written myself. I really
would have thought so, as I wouldn't have thought my parser generator
to be particulartly magical.

(That's for the LL(1)-like wildcard functionality; the non-greedy
context free thing seems like a tougher nut, worthy of some research,
I suppose.)

> I too find LaTeX's style of markup too verbose and rigid.  I haven't
> used XML, but I feel the same way about HTML.

Yup. XML is just as verbose as HTML. You can write custom sets of
tags, at least.

> I like TeX's style of markup, and it's the markup language I'm most
> familiar with.

I see.

One of the reasons I'm working on this is to let non-expert users
format text, e.g. with email-like markup. I also want to be able to
figure stuff out like that the following is a list:

  1. Foo

  1.1 Bar

While the following are two headers with a paragraph in-between:

  1. Foo

  Baz

  1.1 Bar

This sort of thing needs real parsing, not just macro expansion.

> TeX is also the reason why I suggested using a macro processor. In
> general, I find that it pays to look at the way Knuth solves a
> problem.

I'm sure. He is brilliant, of course. Still, I feel (at the risk of
sounding heretical) that some of his work is a bit old-fashioned. I
think Metafont has excellent functionality, but the language seems
very strange to me.

I'm not 100% convinced about TeX either. It reminds me about old
programming languages where you could use operators as variables
etc... :]

Oh, well. I guess I'll figure it out. And I'll have a look at the
tokenization method you suggested (in the Knuthian spirit ;)

Thanks,

- Magnus

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]