Re: Tokenizing

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tokenizing

From:	Stephen Leake
Subject:	Re: Tokenizing
Date:	Mon, 22 Sep 2014 09:02:28 -0500
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.3 (windows-nt)

Daniel Colascione <address@hidden> writes:

> Wagner's thesis contains a description of a few alternative incremental
> GLR algorithms that look very promising.

Thanks for the pointer; I really need better error handling in the Ada
mode parser. It currently defaults to a _really_ dumb indentation
algorithm (same as previous line) when the parse fails.

> We can start parsing at some region we quickly determine is semantically
> equivalent to the start of the buffer using something like
> font-lock-beginning-of-syntax-function.  

For Ada, that's only start of buffer; the standard convention is only
one compilation-unit (the single grammar start symbol) per file.

> Here, we can reliably partially parse the buffer by parsing
> forward (remember, GLR parsing is an online algorithm) until the
> parsing limit is after the visible portion of the buffer; we can then
> use a "shortest-to-completion" production (which we can pre-compute
> from the grammar) to insert virtual tokens that complete the language.

I had the same idea; I was hoping it was new and I could make money :).
Is that in Wagner's thesis (I didn't see it on a quick scan), or written
up somewhere else?

> One really nice
> property of GLR parsing (and LR parsers generally) is that we can
> trivially pause and restart them, so we can parse in the background with
> very little impact on user activity, using `while-no-input' to back out
> to the event loop when we detect we have something more important to
> do.

That's what I need! Although a really fast GLR parser implemented in
compiled Ada that accesses the buffer via an Emacs FFI would be better.

> Error recovery is a challenge --- we want to do something useful on
> buffers that are temporarily invalided by user edits.  One of the
> biggest challenges is dealing with mismatched brace constructs, since
> one missing "}" in C-mode can cause the entire rest of the buffer to no
> longer conform to the grammar.

I'm hoping to use the same "shortest-to-completion" algorithm for this.
It may distort the rest of the buffer, but it will be better than what
it does now.

> Bridge parsing is particularly interesting here: the basic idea is that
> we perform one pass over the buffer using a simplified grammar that
> looks only at nesting constructs, insert virtual braces to make
> everything balance, and then apply the usual parsing algorithm (with
> ad-hoc error recovery) to the result. I haven't tried it, but it
> apparently works pretty well.

That could work for Ada as well.

> I like what SMIE does, but I don't think it's powerful enough to parse
> C++. 

I wrote a SMIE parser for Ada, but it ended up always parsing the whole
buffer. So I switched to LALR. Then to generalized so I would not have
to mangle the grammar in the Ada standard as much.

> GLR parsers also leave us with a parse forest instead of a parse tree,
> and we really need a parse *tree* if we're going to apply fontification
> rules specified in terms of AST nodes.  We need to prune the parse
> forest down to a parse tree before using it for fontification.

In Ada mode, I just assume the results of parallel parses that end in
the same state are equivalent for the purposes of Ada mode
(fontification and navigation). So I just throw away all but one result.
I haven't really investigated whether this is true.

> Consider the most vexing parse:
>
>   type variable(function())

One reason I like Ada; the equivalent declarations are not ambiguous.

-- 
-- Stephe

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Tokenizing, (continued)

Prev by Date: Re: Tokenizing
Next by Date: Re: Tokenizing
Previous by thread: Re: Tokenizing
Next by thread: Re: Tokenizing
Index(es):
- Date
- Thread