help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Tue, 18 May 2004 14:12:44 +0200
User-agent: Mutt/1.4i

Frank Heckenbach <address@hidden>:
>
> Magnus Lie Hetland wrote:
> 
> > Just an example to show why a fixed set of markup tokens (that will
> > end the plain text) won't do:
> > 
> >   Foo *bar 'baz *fee* fie' foe*.
[snip]
> In this case, I'd just make the asterisk a different token
> (context-independent) which is parsed as normal text inside the
> verbatim rule and as markup otherwise. (This is still CFG, of
> course.)

*Exactly!* That's what I've been saying all along.

The problem has then been specifying the grammar of such runs of
"plain text", i.e., how they are terminated. The user specifies only
the non-plain-text part of the grammar -- in this case, the quotes and
asterisks, and how they are composed into spans of different types.
However, the user does *not* specify that the asterisk is to be
treated as plain text between the quotes. This has to be inferred from
the fact that the only legal token at that point would be a closing
quote -- everything else should be shifted into the verbatim/plain
text production.

My hope was that I could go beyond single-token terminators, but that
seems unrealistic; and for if this sort of thing is determined by
single tokens, I guess it might be as easy to use some
context-sensitive lexing. The result would be the same, I suppose, and
I'm not sure how to achieve it in Bison (without performing a static
analysis and rewriting the grammar).

> I'd generally be careful with content-dependent lexing. You have to
> be careful about the lookahead, i.e. bison can request a new token
> before executing the action concerning the previous tokens, and if
> that action influences how the next token should be lexed, you have
> a problem.

I see. That is a problem.

> It gets worse when you use GLR: As long as the parser is split, it
> doesn't execute any actions, but of course consumes tokens.

Ah. And I was planning on using GLR, to put less of a burden on the
user.

Then I guess I'm back to trying to get Bison to shift all illegal
tokens into the plain text production. I considered using the error
token for this for a while, but that seems like a hack worse than
any, really.

> From the POV of the lexer this is like arbitrary lookahead. So
> unless you know very exactly where your ambiguities are and how long
> the parser can be split, and that during this time no lexer-critical
> actions can occur, you cannot use content-dependent lexing at all.

I see.

[snip]
> The only disadvantage is that the mixed tookens look a bit strange
> and have no clearly defined meaning until they are parsed, but
> that's just an aesthetical issue.

I don't have a problem with that, as the grammar is generated
internally. My problem would be just that -- how to generate this
internally.

Consider the following format (described here in natural language)
which might be specified in Atox:

  - Emphasis begins with an asterisk and ends with an asterisk.
    Emphasis may contain verbatim text.

  - Verbatim text begins with a single quote and ends with a single
    quote.

This is the kind of information I'd have, and from that I'd have to
infer that when an asterisk is encountered inside an occurrence of
verbatim text, it is not a significant token (it is to be treated as
plain text), simply because it is not a valid lookahead -- emphasis
(or anything else using asterisks) is not allowed there. In other
words, It has to be inferred from the grammar.

> This example was especially difficult because context-dependent
> tokens could overlap. If you have only single characters or so, it
> will be easier.

I don't, sadly. I have things like lines of dashes underlining
headers, as well as emdashes written "--", in the samples. In general,
the user may specify anything.

> As you state it, it can be described by a regex: simply `(.*)(!)' to
> get the two parts in the parentheses, or in flex: `.*/!' to get the
> first part up to the last `!'.

Yup. But this was a bad (i.e. too simple) example.

[snip]
> Apart from the fact that such a rule is quite fragile (if someone
> deletes a few words from the first multi-line paragraph, it may
> become one-line and surprisingly turn into a title),

Sure -- it was just an example. It's up to the user to define the
exact format.

> this one can be solved with a CFG.

Yes, I know. That's my point -- I need a CFG, and I also need to infer
what the possible tokens at each point are, from this CFG. Whether the
parser or the lexer deals with skipping the insignificant tokens isn't
important to me -- I just need to skip them, rather than getting a
spurious syntax error.

> It can also be solved with a global state variable (paragraph_seen
> or similar). So that's no decisive argument either way.

Sure. But this was just a single example. In general, I do need
something like a CFG.

[snip]
> OK, here we have a more complicated example. This may be helpful to
> get to the root of the problem. IIUIC, this is a line with emphasis
> and a second (strange) line:
> 
> Foo *bar*
> ---+-----
> 
> And this is a heading without emphasis and two asterisks in the
> text:
> 
> Foo *bar*
> ---------

Yes -- assuming that this is how the user has set it up. (The format
is 100% user-specifiable -- that's really the reason for all these
difficulties. Hard-coding a format as a grammar would be easy.)

> If the heading and normal line require substantially different
> parsing (as the emphasis does already -- if it's only this, it could
> be kludged around, but I suppose there's more and you really need to
> parse it differently), I see two ways of doing it: 1. with
> look-ahead in the lexer (quite hairy, similar to content-dependent
> lexing, not recommended), 2. using GLR parsing.

Yes, that was what I was hoping for -- but I haven't been able to make
it work.

> If you make sure that the language is actually unique (e.g., your
> lexer gives a special token for `---' lines which is only parsed in
> headings),

Hm. I can't really guarantee that. For example, the following might be
a legal table:

  ----------------------------
  Foo   Bar    Baz   Frozzbozz
  ============================
  1     2      3     4
  5     6      7     8
  ----------------------------

> then you might not need dprec etc. The scanner will fork, one branch
> will run into a parsing error sooner or later (because one branch
> will not acecpt the `---' token and the other one will require it),
> and the other branch will continue. This is the "most harmless" way
> of using GLR.

I guess. But how would I specify the rule leading up to the
underlining? The way I've been trying to do it, I've specified it as a
rule matching anything at all -- and if I do that, I need for the more
specific rule to trump it somehow.

On the other hand, I could simply say that it consists of all tokens
except a run of dashes (for example) -- but how am I to find that out
from the parser state, and how am I to signal it to Bison, without
statically rewriting the grammar? (I could do that, I suppose, but I'm
worried that it would be overly strict, and miss some legal parses.)

[snip]
> > Because what tokens are significant (as opposed to "plain text")
> > is described by a user-specified context-free grammar.
> 
> Then, of course, the scanner (i.e., the flex input, if you use flex)
> must be generated from the user's grammar as well, along with the
> bison input.

Yes, it will be.

> > OK. Maybe so. But finding the first occurrence of a disjunction of
> > strings (or patterns) can be done in linear time; once you have the
> > first occurrence, you also know the length of the span that *didn't*
> > match any of the patterns.
> 
> This might be a viable solution. AFAIK, flex doesn't support it,

I'm not sure what you mean. This is exactly what Flex does -- it finds
the first occurrence of one of its tokens. Of course, eliminating some
of those would mean bailing out in the action of those tokens, but as
far as I can see you'd still get linear running time.

[snip]
> This is especially practicable if your non-greedy wildcards are
> actually `.*'.

As long as I know which tokens are legal, Flex's functionality works
just fine (if I go for the LL(1)-style plain text rule). The problem
is finding out which are legal in a given state (which it seems is
possible) -- and dealing with what you mentioned about Bison  fetching
several lookahead tokens, which would foul things up.

[snip]

Oh, well. I guess I could always use Flex for tokenization, and adapt
my current LL(1)-like parsing code to run on top of it -- that still
ought to speed up things a bit from the current, even if I don't get
the full speed of Bison.

Or I could try to analyze every "plain text"-position in the grammar
and figure out which tokens would be legal there; any pointers on how
I'd do that?

A contrived example:

  foo: 'a' wildcard;

  bar: 'b' foo 'x';

  baz: 'c' foo 'y';

Here I'd like to find out which tokens are allowed in the wildcard
pattern. A simple analysis would tell me "anything except 'x' and
'y'". However, ideally, I'd like that to depend on what has already
been parsed. I.e. if a 'b' has been seen, I could shift any token
except 'x', while if a 'c' has been seen, I could shift any token
except 'y'.

I guess I could split the foo rule into two different rules -- but I'm
not sure how to do such an analysis without resorting to lots of ad
hoc mechanisms. Perhaps the LR table could be used somehow?

Ideally, I'd like to use the LR table and the current state as I'm
parsing, to specify what tokens to look for (or to skip) -- but as I
cannot safely let the scanner to this (as you pointed out) I am again
left with trying to get Bison to do this sort of thing.

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]