bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: several messages


From: Akim Demaille
Subject: Re: several messages
Date: Wed, 16 Sep 2009 16:32:16 +0200


Le 8 sept. 2009 à 01:42, Joel E. Denny a écrit :

My grammar has:
- 185 rules
- 104 terminals
- 58 nonterminals

How many S/R and R/R conflicts? I'm also interested in conflicts that are
resolved by prec/assoc declarations.

There are no remaining conflicts. The updated log has the figures on solved conflicts:

- 1143 in lalr/ielr
- 64650 in full lr

It's good to see that the IELR computation time is barely worse than LALR.
Because IELR didn't split any states, the output files are of course
exactly the same. However, that also means there may not have been much
for IELR to do, and so the timings may not reveal much about IELR's
complexity. The number of conflicts would provide some insight here...
but maybe I'm getting off topic a little.

(By the way, I've been meaning to add a test case for Bison that makes
sure the LALR and IELR parser tables are exactly the same for
parse-gram.y.  If they ever become different, we'll need to fix
parse-gram.y or start using IELR.)

Good point :)


canonical-lr-all

Execution times (seconds)
IELR(1) Phase 3 : 0.33 (10%) usr 0.01 (10%) sys 0.00 ( 0%) wall IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall parser action tables : 2.19 (65%) usr 0.01 (10%) sys 128.00 (100%) wall outputing parser : 0.16 ( 5%) usr 0.02 (20%) sys 0.00 ( 0%) wall running m4 : 0.65 (19%) usr 0.05 (50%) sys 0.00 ( 0%) wall freeing : 0.01 ( 0%) usr 0.01 (10%) sys 0.00 ( 0%) wall
TOTAL                 :   3.36             0.10           128.00
Bison 3.49s real  3.37s user  0.11s system
G++   7.85s real  6.76s user  0.76s system
state 12941
2.7M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
1.6M /tmp/ugrammar.o
25M /tmp/ugrammar.output

I'm glad to see that most of the extra time does not fall in the original parser table computation. It just takes a really long time to compress
and transform those huge tables into the output code.

Yes, indeed.

canonical-lr-consistent

Execution times (seconds)
IELR(1) Phase 3 : 0.36 ( 2%) usr 0.01 ( 4%) sys 0.00 ( 0%) wall IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall parser action tables : 16.38 (91%) usr 0.05 (18%) sys 0.00 ( 0%) wall outputing parser : 0.34 ( 2%) usr 0.08 (29%) sys 0.00 ( 0%) wall running m4 : 0.89 ( 5%) usr 0.13 (46%) sys 0.00 ( 0%) wall freeing : 0.02 ( 0%) usr 0.01 ( 4%) sys 0.00 ( 0%) wall
TOTAL                 :  18.01             0.28             0.00
Bison 18.64s real  18.02s user  0.30s system
G++   10.63s real  9.15s user  0.96s system
state 12941
5.6M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
2.3M /tmp/ugrammar.o
27M /tmp/ugrammar.output

Wow. What a jump. I can see why you don't want this. I have to admit that I haven't spent much time exploring the combination of canonical-lr
and different values of lr.default-reductions.  Maybe it's bad that I
defaulted it to "accepting"? I just didn't want anyone to argue that they
specified canonical LR but didn't get it.

I have no strong opinion about this. After all, someone reading the doc about lr.type should be ready to read the section about lr.default- reductions. We have to warn them. My figures are public, if you mean to use them in some form in the documentation, that would be fine with me.


canonical-lr-accepting

Execution times (seconds)
reader : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall IELR(1) Phase 3 : 0.37 ( 1%) usr 0.01 ( 2%) sys 0.00 ( 0%) wall IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall parser action tables : 46.08 (95%) usr 0.21 (34%) sys 0.00 ( 0%) wall outputing parser : 0.61 ( 1%) usr 0.12 (20%) sys 0.00 ( 0%) wall running m4 : 1.28 ( 3%) usr 0.26 (43%) sys 0.00 ( 0%) wall freeing : 0.01 ( 0%) usr 0.01 ( 2%) sys 0.00 ( 0%) wall
TOTAL                 :  48.38             0.61             0.00
Bison 49.65s real  48.39s user  0.63s system
G++   13.04s real  11.08s user  1.03s system
state 12941
8.5M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
3.2M /tmp/ugrammar.o
31M /tmp/ugrammar.output

Ouch.  Well, more motivation to steer clear of canonical LR.

:)


It would be nice to extend the *.output file with various metrics
about the grammar.

Agreed.

Will add this to TODO.

The syntax error on "1 1" behaves as expected with both accepting and
consistent (i.e., mute since there are too many possibilities ;).

If you could unmute it (by adding a few more possibilities in the
skeleton), it would be interesting to see if canonical LR gives the
same
expected tokens list.

I was interested in extending it temporarily to compare the expected
tokens for your specific grammar. Of course, I also would like to see it
fixed in general....

I'll keep this as another TODO item :)


One way out of the fixed arity would be to use "note: " lines, as Alex
suggested it (and I'm slowly changing my mind on this regard).

I was thinking we were both only opposed to dropping the warning label. I like the idea of a special function for formatting submessages. I'm not
so sure about the "note: " prefix though.

I changed my mind when I saw how current GCCs talk to me.

$ cat /tmp/foo.cc
void foo (float*) {}
void foo (int*){}

int
main()
{
  foo(1);
}
$ g++ -Wall /tmp/foo.cc
/tmp/foo.cc: In function 'int main()':
/tmp/foo.cc:7: error: call of overloaded 'foo(int)' is ambiguous
/tmp/foo.cc:1: note: candidates are: void foo(float*) <near match>
/tmp/foo.cc:2: note:                 void foo(int*) <near match>

and Emacs correctly understands this as a single error.


However, for the generated parser rather than for bison, strings fed to yyerror currently don't contain newlines. I'm not sure how important or how long-standing that convention has been. If we move to a %error- report
or something like that, that might be a good opportunity to break the
convention... or to let the user take control.

I think we should still keep separated the action of building the error message, and "sending" the error message. But indeed, maybe it's wrong. It's clearly "nicer", more modular, but on the other hand it may push on the user useless memory-allocation issues that do not even exist if she uses directly fprintf in her %error-report.

Still...  I feel uncomfortable with the fusion of both.



My suspicion would have been that disabling default reductions almost always does more good than harm, but I think we need more exploration.

So if we can't trust static tables, should we actually try each
possible token one after the other, and see if one can be shifted at
some point? Of course without touching the stack, that can be quite a pain to write :( But it should not be really different from our error-
recovery code.

You're reading my mind.  :)  I think we've realized for a while that
%error-verbose is not perfect.

Yes, I agree.

However, there's one problem with your suggestion. By the time the syntax error is reported, it's too late. The parser may have already performed
some default reductions

Ah!  Of course!

Thus, immediately upon fetching the token (it's fine if the fetch comes after performing some default reductions in consistent states), the parser needs to determine whether the token can eventually be shifted. That is,
the parser first tries the parse without performing semantic actions,
without manipulating the value and location stacks, and without losing the original state stack. If it successfully reaches a shift, it then repeats
the parse on all the normal stacks.  If unsuccessful instead, it
immediately reports a syntax error and then repeats the exploration for every token, as you suggested, so it can report what tokens were expected. During each such exploration (during parsing or at error recovery), the parser need not duplicate the normal parser state stack in order to avoid
losing the original context.  Instead, it keeps a temporary stack that
records only the exploratory states that are not already recorded on the normal parser stack. It also keeps a pointer into the normal parser stack in order to record the base of the temporary stack. That pointer may move
during exploratory reductions.  I've implemented all this already in
yacc.c, and it functions correctly so far.  I'm just tidying up memory
management at the moment, and I need to do more testing.

That sounds exciting!


At worst, LAC should no more than double the parsing time. That is, even
though it's similar to GLR or backtracking, it tries only one possible
parse on any token up to its shift before performing the parse for real or
reporting a syntax error.  (Also unlike GLR and backtracking, we still
resolve conflicts statically.  The parser is still deterministic.)  My
suspicion is that, given the time for file I/O, lexical analysis, and
semantic actions, the total run time of the parser will not be affected
severely by this doubling.  Of course, building a syntax error message
will take longer than double, but I'm not as worried about that. When I
finish implementing, I'll try collecting some statistics.

Now I see what you're using bench.pl for :)


Another interesting point is that, with LAC, IELR should always perform exactly the same semantic actions as canonical LR even for syntactically
incorrect sentences.

Nice feature!

 Previously, IELR could only claim that for
syntactically correct sentences. (The same can be said for LALR as long as there are no mysterious conflicts, but, unless there are no conflicts
at all, that's hard to verify without an IELR implementation anyway.)
One can even imagine using LAC in conjunction with a push parser to
implement, for example, tab completions.

I also had this in mind.


Keep in mind that, with LAC, the only possible difference between
lr.default-reductions=all or =consistent for any fixed value of lr.type should be parser tables sizes and performance (that is, possibly many more reductions to explore before permitting normal parsing), which would be
interesting to collect statistics on.

It would also be nice to explore the merits at runtime of the table compressions. I remember that some other tools (maybe Yacc++) offers this feature. Space is probably no longer an issue, so maybe there would be some added runtime performances to use raw tables.


BTW, I've not yet done any homework to see if any of this has been done before. Now that I write it down, it seems like such a simple premise, so maybe it has been done many times. I should probably have researched it before banging out an implementation. Maybe there's a better way already
discovered.

It seems that indeed, many many results were obtained about LR parsers, but many of them are now somewhat forgotten. Yet, there are so many papers out there that I don't know what's the right means to fetch only the relevant ones.

Do I also understand that we neither have a superset nor a subset,
just some fuzzyset?

For expected tokens reported in syntax error messages, that's correct.
It's unsettling, isn't it?

Yes :)

I can send ugrammar.y privately, if you wish.

I'd like to see that.  Thanks!

Will do right afterwards. Please, no dirty jokes about my grammar :) I wish it was nicer in many places...



reply via email to

[Prev in Thread] Current Thread [Next in Thread]