help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: enum instead of #define for tokens


From: Hans Aberg
Subject: Re: RFC: enum instead of #define for tokens
Date: Thu, 4 Apr 2002 16:59:51 +0200

At 12:19 +0200 2002/04/04, Akim Demaille wrote:
>Hans> You are right, this is not very nice :-):
>
>IMHO, my position is not nice wrt people who are abusing the system.

It is not abuse; it is standard Yacc. You removed yourself the original
--raw feature that skipped the characters.

>The example of Unicode demonstrates how bad it was to let chars be
>tokens.  That default is very C specific, I really doubt that in other
>languages, such an atrocity remains in their native Yaccs.  But I
>confess I don't know.

So let us know when you do know. :-)

>Hans> I think you are imposing your own programming style here.
>
>Not quite.  I'm imposing the theory that goes under the scene.

And this theory is?

>Hans> I tweaked my bison.simple file so that when it encounters an
>Hans> unknown character (known as a character by its range), it writes
>Hans> it out, instead of just saying "undefined". One then can make
>Hans> full use of the Flex . { return (unsigned char)yytext[0]; }
>Hans> rule.
>
>Scanning errors ought to be caught by the scanner, not the parser.

The thing is that the parser may perform error recovery, in which case the
scanner error should be handed over to the parser anyway.

Right know, I do not know how scanner errors are handed over to the parser
-- perhaps there should be a macro for that in the .tab.h header.

>Hans> Very convenient: One spin-off is that one gets access to the
>Hans> error reporting system of the Bison parser also for such
>Hans> characters.
>
>If you want Bison to give an access to the $undefined token, I'm ready
>to do that.  Then the scanner may return this token.  And nothing
>prevents this token from having a value: the string, which can be used
>in error messages.

Something like that: I had to implement a routine for writing out the
character not as raw, but in a suitable encoding. So when "undefined"
occurs, it should have an access to the token number.

Perhaps a special yyerror_undefined, with an extra argument (which, by
default is yyerror).

>>> As a result, there is no such issue as a Unicode compliant parser.
>
>Hans> Bison is already Yacc "char" compliant, starting at 257. So I
>Hans> think there should be a corresponding Unicode feature: Unicode
>Hans> has so many characters, that one needs a convenient way of
>Hans> handling them.
>
>All this discussion is anyway not taking into account the impact that
>Unicode can have on the size of the Bison tables.  From the theory
>point of view, I'm very much against Unicodization, from the practical
>point of view, I'm not even sure it is doable.  And most importantly,
>I'm sure that if it's done in the scanner, these problems vanish.

I am not sure what you are speaking about here: The regular token numbers
are translated, so they will not have any impact by Unicode at all.

The only impact that can happen is the use of many tokens, but that will
happen regardless whether one is using the Bison character feature or not.

>Yacc and Lex, as Parsers and Scanners, struggled from a clean
>separation of the two different tasks.  We should keep these task
>separate.

This is a purist point of view: Yacc and Lex are not theoretical tools, but
something that should be convenient to the user. Besides, if you merge the
scanner/parser language hierarchy, from the theoretical point of view, you
get a single language.

>Note that this is very different from referring the input formalism.
>We can very well imagine a FlexBison that input a single file for both
>the scanner and the parser.  But still there would be a parser and a
>scanner in the output.  So called scannerless parsers do have a
>separation somewhere between lexical and syntactic.

I am not speaking about a Flex/Bison merge, here. Those that are not using
the character feature will not be hurt by the token numbers starting at 2^n
+ 1, as you know, in view of that you removed --raw.

Otherwise, the separation between scanners/parsers is a traditional one,
dictated by efficiency (makes the scanner faster), but otherwise nothing
that one has to do from the theoretical point of view. In fact, I recall
(comp.compilers or somewhere) somebody writing on a fully integrated
system, with actions for the scanner characters then I figure, but it would
probably not be very efficient. But on a fast computer, perhaps it makes
little difference.

  Hans Aberg





reply via email to

[Prev in Thread] Current Thread [Next in Thread]