[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: how to get left hand side symbol in action
From: |
Christian Schoenebeck |
Subject: |
Re: how to get left hand side symbol in action |
Date: |
Fri, 10 May 2019 15:11:17 +0200 |
On Freitag, 10. Mai 2019 07:24:51 CEST Akim Demaille wrote:
> Hi Christian,
Hi Akim!
> Aren't you referring to LA correction, as implemented in Bison?
>
> https://www.gnu.org/software/bison/manual/html_node/LAC.html
Well no, it is has somewhat in common, but the use cases are differently, see
below.
> > That's actually a very common required feature in practice. The main use
> > cases are:
> >
> > 1. Auto completion features
> >
> > 2. Human friendly error messages
>
> I think you are referring to the name of the tokens, not all the symbols.
> For the error messages, it makes sense. Although I am now more convinced
> that most of the time, error messages are more readable when you quote
> exactly the source than when you print token names.
No, I really mean the non-terminal symbols. Because sometimes [not always ;-)]
I don't use the classic approach of separate lexer and parser, but rather let
the parser do the lexer job as well, like:
FOO : 'F''O''O' ;
which can avoid complex and error prone push and pop constructs that would be
required with a separate lexer approach and certain complex grammars. So
effectively some of the non-terminals (from Bison point of view) are in such
specific use cases "terminals" from use case point of view. But one of the main
problems of this approach is that e.g. the default syntax error messages would
no longer be useful at all. Because you would get a message like:
Unexpected 'b', expecting 'F'.
Instead of e.g.
Unexpected 'bla', expecting 'FOO'.
> > I do need these features for almost all parsers, hence for years (since
> > not
> > available directly with Bison) I have a huge bunch of code on top of the
> > internal skeleton code to achieve them.
>
> Is there available for reading somewhere?
Well, I asked couple years ago on the list if anybody was interested in what I
am doing to achieve these kinds of features. But got no positive reply, hence
I did not invest time to write about it in detail yet.
However I am not sure if my approach would be of use for you anyway. Basically
what I am doing (roughly described) is C++ code on top of the internal
generated parser tables which replicate what the skeleton LALR(1) parser would
do. So the algorithm takes a parser state as argument:
std::vector<YYTYPE_INT16>& stack
and then it walks the possible routes starting from there (accessing the auto
generated tables directly), that is pushing for each route/branch on stack,
reducing whenever required etc. And to detect endless recursions while doing
this, I always track the complete parser history with
typedef std::set< std::vector<YYTYPE_INT16> > YYStackHistory;
So each entry in this history set is a previous parser symbol stack, and when
I reach a new parser state I check if the current parser stack already exists
in that history set and if it does exist already, then it reached an endless
recursion and I abort the algorithm for that individual branch and then
continue with the next branch, and so on. That full history tracking might
take a lot of memory of course though, but eventually the algorithm ends up
returning a result:
std::map<String,BisonSymbolInfo>& expectedSymbols
where the result map (not a multi map actually) contains the possible next
grammar rules for the previously supplied parser state; key being the symbol
name, value is a struct (BisonSymbolInfo) holding a) the sequence of
characters expected next for satisfying that grammar symbol and b) a flag
whether the symbol is (considered as) terminal or a "real" non-terminal (see
below).
So that C++ code is for me the basis for retrieving the next possible grammar
rules for any given parser state, which I use both for error messsages as well
as for auto completion features.
Another problem with my pseudo-terminals (example FOO above): From Bison point
of view, everything in my grammar are now non-terminals, and I don't want to
manually mark individual grammar rules to be either considered as "real" non-
terminals and others to be considered as "terminals", So I automated that by
checking whether the same symbol can be resolved in more than one way, then it
is a "real" non-terminal, and by checking if the right hand side of the rule
only contains characters, then it is considered as "terminal. And that
fundamental information is automatically used to provide appropriate error
messages and auto completion in a fully automated way.
> Was the feature always fitting perfectly? Never ever did it result in
something somewhat incorrect?
I did not make a proof of correctness of the algorithm. As you know you have
to spend a huge amount of time to really proof this, which I did not.
Especially when you are working on commercial projects you also have to be
pragmatic sometimes and also accept the chance of imperfectness for the sake
of getting forward and weigh costs on usefulness.
The main issue was catching endless recursions that I solved like described
above. I have not investigated whether there would be cases the algorithm
would fail, there might be. Keep in mind the use cases for this was: error
messages and auto completion, not to execute (i.e. fatal) grammar rule
actions, so the potential harm is limited. Even if the algorithm would fail in
certain cases, the worst you get is that auto completion does not come up with
a suggestion in that specific case or the error message being too short. It is
still an improvement on the overall situation though of not having the
possibility for these kinds of features at all.
> >> In addition, tokens have several names: the identifier, and the
> >> string name, like
> >>
> >> %token <string> ID "identifier"
> >>
> >> Not to mention that I also want to provide support for
> >> internationalization. So what name should that be? ID?
> >> identifier? or identifiant in French?
> >
> > In practice you just need the symbol name as is. Nobody needs the
> > translation,
> I beg to disagree. Nobody should translate the keyword "break",
> but
>
> > # bison /tmp/foo.y
> > /tmp/foo.y:1.7: erreur: erreur de syntaxe, : inattendu, attendait char ou
> > identifier ou <tag>>
> > 1 | %token: FOO
> >
> > | ^
>
> looks stupid; "char", "identifier" and "<tag>" should be translated.
Well, my point was that translation is trivial. An average developer is
certainly able to solve required translation tasks very easily by custom code
if required. The other aspects though, looking ahead on a parser states with
custom code is not trivial. So my point was: on doubt it might make sense to
concentrate on the mandatory aspect of providing access to the raw symbol
names and ignoring the translation aspects for now.
Best regards,
Christian Schoenebeck
- Re: how to get left hand side symbol in action, (continued)
- Re: how to get left hand side symbol in action, r0ller, 2019/05/09
- Re: how to get left hand side symbol in action, Akim Demaille, 2019/05/10
- Re: how to get left hand side symbol in action, Hans Åberg, 2019/05/10
- Re: how to get left hand side symbol in action, Derek Clegg, 2019/05/10
- Re: how to get left hand side symbol in action, Christian Schoenebeck, 2019/05/10
- Re: how to get left hand side symbol in action, Akim Demaille, 2019/05/10
- Re: how to get left hand side symbol in action, Derek Clegg, 2019/05/10
- Re: how to get left hand side symbol in action, Christian Schoenebeck, 2019/05/11
- Re: how to get left hand side symbol in action, Akim Demaille, 2019/05/19
- Re: how to get left hand side symbol in action, Christian Schoenebeck, 2019/05/20
- Re: how to get left hand side symbol in action,
Christian Schoenebeck <=
- Re: how to get left hand side symbol in action, Akim Demaille, 2019/05/10
- Re: how to get left hand side symbol in action, Christian Schoenebeck, 2019/05/10
- Re: how to get left hand side symbol in action, Hans Åberg, 2019/05/10
- Re: how to get left hand side symbol in action, Derek Clegg, 2019/05/10