Re: RFC: enum instead of #define for tokens

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: RFC: enum instead of #define for tokens

From:	Paul Eggert
Subject:	Re: RFC: enum instead of #define for tokens
Date:	Thu, 4 Apr 2002 12:39:38 -0800 (PST)
> From: Akim Demaille <address@hidden>
> Date: 04 Apr 2002 12:11:45 +0200
> 
> Also, under some conditions (which are pretty rare, I agree),
> symbols and rules have to be renumbered.  This means that you need
> to walk through all the arrays, renumbering from the old number, to
> the new number.  Using pointers, I no longer have to do that: you
> just change the member `number' of symbols/rules, and your done.

Hmm, does this action occur at run-time?  I thought that these arrays
were readonly.  For example, yyr1 is readonly.

If you're talking about something that occurs while Bison is running,
that's a different matter.  But if you're talking about the generated
parser, then it can improve performance quite a bit in some cases to
use integers rather than pointers.

> The motivations for this change is (i) making the maintenance of Bison
> easier, thanks to a minimum of type checking service from the
> compiler

I'm afraid that I am still a bit lost here.  When you talk about
maintaining Bison, it sounds like you're talking about the Bison
executable itself, and here it's clearly OK to shift to a cleaner
representation even if it's slower.  But I thought we were talking
about the generated parser.

This is an unusual situation because we are talking about
machine-generated code.  In my experience, in the Bison-generated part
of the parser, type checking mostly just gets in the way.  Bison
generates standard code that works with any C compiler.  If the C
compiler rejects the code due to type checking, it's usually a bug in
the C compiler, not a bug with Bison.

You deal with buggy bleeding-edge Bison implementations more often
than I do, so no doubt things are different with you.  But we also
have to consider the maintenance needs of Bison users, and I think
they more often are similar to my experience in this area.


> (ii) making extensions of Bison much easier: you don't
> need to know all the arrays that exist in there to recovered the
> values associated to this or that guy: you have the guy, so you have
> all you need about it.

I don't know what sort of extensions you're considering, so I'm at a
bit of a loss here.  But it seems to me that, if the tables are
read-only, it's pretty straightforward to give users either the
integer representation or the pointer representation.  You can do
something like this, for example:

  static short yyr1[] = { 0, 54, ... };

  static inline struct yysymbol *
  yyr1p (struct yyrule const *r)
  {
    return yysymbol_base + yyr1[r - yyrule_base];
  }

yyr1p uses the pointer representation, but it's logically equivalent
to the integer representation.  And, though it looks slower, yyr1p
might conceivably be faster overall than the naive pointer
representation, since it might require less memory traffic.

> We already know that we have to escape from short in most places

Only for the parts of grammars that are large.  And we can even shrink
short to char in some cases, which will make things smaller.

> Can this be really a problem?

Yes, I'm afraid so.  It's not simply that pointers are larger than
integers and take more memory.  It's also that they are much more
expensive to initialize when linking object modules dynamically.  Not
only must you spend CPU time at runtime to relink them; you must also
make a copy of each shared-library page that contains relocatable
pointers, since it can't be shared among different processes.  This
loss of sharing can be a real drain.

> Do we have to forget about the natural C programming, heavily based
> on pointers, to move to indices in arrays :(

If you're talking about initialized data, then yes we do have a real
performance concern here.  If you're talking about malloc'ed data it's
not nearly as big a worry, though for large parsers I think the
smaller tables will still often be an overall win.  (Of course this
ought to be measured....)

> I really want to handle my guys, not artificial numbers, and I want
> the compiler to type check what I do.  Even more: I want the type to
> tell me what I'm manipulating, using shorts only makes it a nightmare
> to find the meaning.  And typedefing shorts is not a solution, as the
> compiler does not help.

Sorry.  You can work around some of the problem by wrapping the
integers inside little structures.  But I don't recommend this; it
makes the code harder to read and some compilers don't optimize it as
well.

> POSIX is probably not referring to Unicode anyway.

Officially, POSIX is agnostic about the character set.  You could be
using Unicode, or EBCDIC, or Macintosh Roman for all it cares.

> And IIRC, POSIX mandates 257 as first symbol number, so if we move
> to Unicode char-tokens, we are no longer POSIX compliant.

No, POSIX merely says that symbol numbers have to be greater than 256.
It does not require that Bison must use 257, 258, ....

POSIX does say that the default value of the error token must be 256,
and that if you use that default, the lexical analyzer should not
return 256 and thus 256 cannot be used as a character.  I don't think
this is a real problem in practice, but if it is a user can work
around it by changing the value of the error token with a %token
declaration.


> Paul> At best we can warn in the documentation that it doesn't work if
> Paul> you change encodings between the Bison run and the cc+runtime
> Paul> runs.
> 
> Or we should find a means not to output the characters as
> shorts/integer, but as the characters themselves.

To do that without loss of efficiency (on modern hosts, anyway), I
think we'd need to use designated initializers, and we might have to
renumber things.  E.g., instead of this:

  static const char yytranslate[] =
  {
         0,     2,     2,     2,     2,     2,     2,     2,     2,     2,
        44,     2,     2,     2,     2,     2,     2,     2,     2,     2,
         2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
         2,     2,     2,     2,     2,     2,     2,     2,    42,     2,
        ...
  }

we would need something like this:

  #ifndef YYHAVE_DESIGNATED_INITIALIZERS
  # define YYHAVE_DESIGNATED_INITIALIZERS (199901 < __STDC_VERSION__)
  #endif

  static const char yytranslate[297] =
  {
  #if YYHAVE_DESIGNATED_INITIALIZERS
       /* This works even with EBCDIC.  */  
       [ 0 ] = 2,
       [ '\n' ] = 44,
       [ '&' ] = 42,
        ...
  #else
        /* This assumes ASCII.  */
         2,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        44,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         0,     0,     0,     0,     0,     0,     0,     0,    42,     0,
        ...
  #endif
  }

Unfortunately ISO standard C does not provide a way to say the default
value is 2; the default value is always 0.  So we'd have to renumber
the output of yytranslate, e.g. switching 2 and 0 as in the above example.

> Make it a muscle, and use it in the skeleton.

OK, but I'll run it by bison-patches first.

Also, before I do that, here is a discrepancy between
bison-1_29-branch and my private copie of the merged version of
bison.simple.  Is this discrepancy intended?  I am concerned about the
first two changed lines in this hunk; the others are OK I guess.


--- bison-1_29-branch/src/bison.simple  2002-04-04 12:23:07.218001000 -0800
+++ bison.tmp/data/bison.simple 2002-04-03 10:26:17.333000000 -0800
...
@@ -694,18 +910,23 @@ yyreduce:
       int yyi;
 
       YYFPRINTF (stderr, "Reducing via rule %d (line %d), ",
-                yyn, yyrline[yyn]);
+                yyn - 1, yyrline[yyn]);
 
       /* Print the symbols being reduced, and their result.  */
-      for (yyi = yyprhs[yyn]; yyrhs[yyi] > 0; yyi++)
+      for (yyi = yyprhs[yyn]; yyrhs[yyi] >= 0; yyi++)
        YYFPRINTF (stderr, "%s ", yytname[yyrhs[yyi]]);
       YYFPRINTF (stderr, " -> %s\n", yytname[yyr1[yyn]]);
     }
 #endif
-%% actions /* The action file replaces this line. */
-#line
+  switch (yyn)
+    ]{
+      b4_actions
+    }
+
+/* Line __line__ of __file__.  */
+#line __oline__ "b4_output_parser_name"
 
-  yyvsp -= yylen;
+[  yyvsp -= yylen;
   yyssp -= yylen;
 #if YYLSP_NEEDED
   yylsp -= yylen;
@@ -733,11 +954,11 @@ yyreduce:
 
   yyn = yyr1[yyn];
 
-  yystate = yypgoto[yyn - YYNTBASE] + *yyssp;
+  yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
   if (yystate >= 0 && yystate <= YYLAST && yycheck[yystate] == *yyssp)
     yystate = yytable[yystate];
   else
-    yystate = yydefgoto[yyn - YYNTBASE];
+    yystate = yydefgoto[yyn - YYNTOKENS];
 
   goto yynewstate;
[Prev in Thread]
Current Thread
[Next in Thread]
Re: RFC: enum instead of #define for tokens, (continued)
Prev by Date: Re: RFC: enum instead of #define for tokens
Next by Date: webTrax.TV! -SUPERIOR ecommerce and web analysis tool!
Previous by thread: Re: RFC: enum instead of #define for tokens
Next by thread: Re: RFC: enum instead of #define for tokens
Index(es):
- Date
- Thread