Re: Disappearance of rhs[], prhs[], and user control of table generation

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Disappearance of rhs[], prhs[], and user control of table generation

From:	David M. Warme
Subject:	Re: Disappearance of rhs[], prhs[], and user control of table generation.
Date:	Sun, 17 Apr 2011 22:00:39 -0400
Akim and Bison fans,

> I would really like to know more about your uses of Bison.
> What kind of use can require these tables (yyrhs and yyprhs)?

The short answer is that we have cases where we construct special parse
trees that contain raw Bison rule numbers and symbol numbers.  We cannot
decipher these parse trees without access to all of the following
tables: yytname[], yyr1[], yyr2[], yyprhs[] and yyrhs[].

Here is a longer explanation of how this situation materializes within
our domain.

----------------

We have several large applications, each of which digests many flat text
input files, each of which has its own distinct grammar.  We use Bison
to parse all of these input files, regardless of whether their grammar
is trivial or complex.  In total, we have several hundred Bison
grammars.  The core parsing functionality resides within an application
framework that is shared by the several applications.

Attributes of this parsing infrastructure include:

- Grammar include files.  Grammars reside in .g files, and grammar
  headers (containing common grammatical idioms) reside in .gh files.  A
  set of m4 macros combine them into a single .y file we feed to Bison.

        - One noteworthy example: we have a .gh file that provides a
          rather capable programming language (vaguely C-like).  This is
          one reason many of our grammars are "large" and "complex".

        - The grammar in some of our .gh files is "templated" (they are
          included with arguments).  This gives us the grammatical
          equivalent of "templated container classes" -- common
          syntactic scaffolding containing invoker-defined grammatical
          sub-phrases within them.

- Each grammar defines a distinct C++ class, each having a yyparse()
  function.  These all derive from a common abstract Parser class.  This
  is vastly cleaner than trying to manage several hundred Bison "yy"
  style prefixes.

- We use our own parser skeleton.  We do not often upgrade to newer
  versions of Bison -- but when we do we are willing to adapt this
  skeleton as required by the new Bison version.  This cost is amortized
  over several hundred grammars, and over intervals of several years
  between Bison upgrades.

- All of the grammars use a single, common lexical analyzer.  It
  natively recognizes C/C++ style comments, and the following "built-in"
  token types:

        - identifiers                           IDENT
        - double-quoted strings                 STRING
        - single-quoted strings                 CCONST
        - integer constants                     ICONST
        - floating-point constants              FCONST

  but it is also smart enough to discover all of the remaining tokens
  (and the token symbol numbering) directly from Bison's generated
  parser tables.  We just write grammar, and never think about lexical
  issues.

- The grammars construct fairly conventional parse trees consisting of
  C++ structs.  We keep them simple and completely independent from the
  application code so that we can use any of these parsers and parse
  trees within simple ad-hoc tools without having to link in any portion
  of the main application's code.

- Our users generally prefer to work with "flat text" input files, since
  this allows using tools like "grep", "sed", "awk" and relational
  databases to gather needed data from various sources.

- One exception to this is geometric or geographic data.  For example, a
  road network can be represented as a graph, where each vertex has an
  associated latitude/longitude.  Using "vi" to edit a long list of
  latitude/longitude pairs is tedious.  In such cases it is better to
  use a GUI tool that can display it all on a map, and let user's click,
  drag and drop things.

- Since the data files are complex, the users put lots of comments in
  them to help remember important assumptions and other info.

- Most GUI tools that edit such "flat text" files use a standard parser
  to read the file (ignoring whitespace and comments), edit the
  structure, and then write the modified structure back out in some
  "aribitrary" format.  This methodology completely loses all comments
  and the indentation structure of the file.  Our users would never use
  any GUI tool that is this brutal to their data files.

Thus we are faced with the problem of doing "structural" editing of flat
text files in a manner that preserves whitespace, indentation and
comments.  This is the particular niche where we absolutely require the
yyprhs[] and yyrhs[] tables.

Our solution is a "generic parser", which produces a "generic parse
tree".  Generic parse trees simultaneously represent both the
phrase-structure of an input file, and its exact character sequence,
including all whitespace and comments.  A single common
generic_yyparse() function can produce a generic parse tree for any of
our several hundred Bison grammars.

Generic parse trees have a special "root" node.  The rest of the generic
parse tree contains only two kinds of nodes: one for tokens, and the
other for nonterminals.  These two kinds of nodes share a common base
class:

struct GenericParseTree {
        const ParserTables * pt;
                                // Tables describing parser/grammar
        Symbol *  root;         // Top level rule (should always be
                                // a non-terminal)
        Name *    endWhite;     // whitespace between final token
                                // and EOF token
};

struct Symbol {
        bool    isToken;        // True for tokens, false for nonterms
        int     symbol;         // Symbol this subtree represents
        int     state;          // State this symbol was recognized
                                // in (i.e., before shift or goto
                                // action)
        NonTerm * parent;       // Parent parse tree
};

struct Token : public Symbol {
        Name * whitespace;      // White space preceeding this token
        Name * token;           // Text of the token iteself
};

struct NonTerm : public Symbol {
        int     rule;           // Rule number whose reduction
                                // produced this non-terminal
        int     nkids;          // number of children
        Symbol ** kids;         // Array of child parse trees
};

The Name object is a hash table entry, and encapsulates an arbitrary
character string (byte sequence).  Only one copy of each such byte
sequence exists, so that two Names are equal if-and-only-if they have
the same address.  Like the EQ function in LISP, our application can
therefore just use pointer comparison to compare strings of variable/
arbitrary length.

Each "shift" action produces a Token node, and each "reduce" action
produces a NonTerm node.  All {...} "action code" regions in the Bison
grammar are ignored by the generic parser.

Here is the key -- a simple traversal of this tree that (at each Token
node) prints out (a) the text in the "whitespace" member, followed by
(b)
the text in the "token" member, and finally the "endWhite" member --
produces the exact character sequence of the original input file.
Everything is preserved identically.  But now we also have access to the
grammatical phrase structure of the input file as well (although perhaps
in a slightly cumbersom form).

The GenericParseTree therefore simultaneously represents:

a. The exact character sequence of the input file,
b. The set of Bison parser tables / grammar rules used to parse it,
c. The exact derivation (i.e., parse tree) of this string according to
   these grammar rules.

Items b and c are a bit cumbersome, but with a few more tools they are
not too difficult to work with.  To find things in this generic parse
tree, one could just "hack" it -- deal with hard-coded symbol numbers
and/or hard-coded rule numbers.  But even the smallest changes to the
grammar will cause Bison to completely renumber everything.  The better
approach is to code things up using actual grammar symbol names.  The
pt->yytname field lets you convert these into symbol numbers correctly
(until you change the name of that symbol in the grammar file or get rid
of it).  Given a rule number k, some fairly simple code such as:

        lhs_symbol  = pt -> yyr1 [k];
        nrhs = pt -> yyr2 [k];
        base = pt -> prhs [k];
        for (i = 0; i < nrhs; i++) {
                rhs_symbol [i] = pt -> rhs [base + i];
        }

gives you access to the definition of rule k, in terms of symbol
numbers.

Given some grammar fragment such as:

        latlon_list:    /* empty */
                | latlon_list latlon
                ;

        latlon: latitude longitude
                ;

        latitude : ...
        longitude : ...

One can actually do a "pattern match" to (a) find the rule number that
matches the pattern:

        latlon_list: * latlon *

(b) make sure there is only one such rule, and (c) return its rule
number.  (The match is performed across all rules k using code similar
to above.)  Now it is a fairly simple matter to traverse the generic
parse tree, looking for all occurrences of this rule k (as a NonTerm
node) and return a list of all of them.  Our infrastructure even knows
how to examine the grammar rules and compute a "derives" relation (just
like the one used inside the lalr logic in Bison) -- and it can use this
relation to avoid recursing down into subtrees that can never include
any occurrences of rule k during this search of the parse tree.

This is where we depend crucially upon having access to both the
yyprhs[] and yyrhs[] tables.  We could never make sense of a generic
parse tree without them.  The yytname[], yyr1[] and yyr2[] tables are
equally crucial.

We have other tools for examining the comment and indentation structure
at any point in the generic parse tree, including pattern recognition
stuff that enable us to "continue the pattern" of surrounding
indentation/comments, for example, when inserting a new element into
such a latlon_list phrase.

Once you have modified the structure of the generic parse tree as
desired, it is a simple matter to write the new character sequence form
of it back out to the flat text file.

To round things out, here is the basic structure we use to encapsulate
all of the parser tables that we require from each Bison grammar:

struct ParserTables {
 int            yyntokens;      // Number of tokens in grammar
 int            yynnts;         // Number of non-terminal symbols
 int            yynrules;       // Number of rules in grammar
 int            yynstates;      // Number of states in automaton
 int            yylast;         // Index of last entry in yytable
 int            yyfinal;        // Final state of state machine
 int            yypact_ninf;    // Special flag value in yypact[]
 int            yytable_ninf;   // Special flag value in yytable[]
 int            yyntbase;       // Symbol number of first nonterm
 int            nsyms;          // Number of elements in yytname[]
 const short *  yyprhs;         // The yyprhs[] table
 const short *  yyrhs;          // The yyrhs[] table
 const char * const *
                yytname;        // The yytname[] table
 const short *  yyr1;           // The yyr1[] table
 const short *  yyr2;           // The yyr2[] table
 const short *  yyrline;        // The yyrline[] table
 const short *  yydefact;       // The yydefact[] table
 const short *  yydefgoto;      // The yydefgoto[] table
 const short *  yypact;         // The yypact[] table
 const short *  yypgoto;        // The yypgoto[] table
 const short *  yytable;        // The yytable[] table
 const short *  yycheck;        // The yycheck[] table
 const char *   parserName;     // Name of the parser class that
                                // defines this grammar.
};

Our parsing skeleton includes the following snippet:

#define yyquotify(x)    yyquotify2(x)
#define yyquotify2(x)   #x

static const ParserTables       yyalltables = {
        YYNTOKENS, YYNNTS, YYNRULES, YYNSTATES, YYLAST, YYFINAL,
        YYPACT_NINF, YYTABLE_NINF, YYNTOKENS,
        (sizeof(yytname)/sizeof(yytname[0])) - 1,
        yyprhs, yyrhs, yytname, yyr1, yyr2,
#if YYDEBUG != 0
        yyrline,
#else
        NULL,
#endif
        yydefact, yydefgoto, yypact, yypgoto, yytable, yycheck,
        yyquotify(CLASS),
};

#undef yyquotify2
#undef yyquotify

        const ParserTables *
CLASS::getParserTables ()
{
        return &yyalltables;
}

This is a very small additional size increment added to each of our
parsers (other than the obvious penalty of forcing all our tables to be
"short" so that they conform to this mold).  This gives us the ability
to obtain "all of the parser tables" for any of our grammars in a
uniform way.  Given (1) the pathname of a file to parse, and (2) a
pointer to the ParserTables object describing the grammar to use, our
common generic parser routine can construct a generic parse tree for the
file.  The ParserTables object also provides access to the yytname[],
yyr1[], yyr2[], yyprhs[] and yyrhs[] tables, all of which are all
necessary to make sense of this generic parse tree.

This is a very brief overview describing just a fraction of the many
ways that we use and abuse Bison in our application.

You should also be aware of the following facts regarding the new
technique Bison uses (in the master branch) to print reduced rules
(using yyssp and yystos[] instead of yyprhs[] and yyrhs[]):

1. This technique only works "in flight" while you are actually parsing
   a file, because only there do you have a parsing stack available that
   represents a valid automaton configuration from which you can "read
   off" the correct sequence of states that gets you the right-hand-side
   of the rule.  Outside of this context, there is no simple way to
   statically determine what rule k looks like.  The yyprhs[] and
   yyrhs[] tables provide this very directly.

2. There is at least one state machine optimization technique that I am
   aware of for which the accessing symbol of states is no longer
   guaranteed to be unique.  (In fact, the set of symbols that access a
   state can be an arbitrary mix of both tokens and nonterminals.)  In
   this context, it is not possible to produce a well-defined yystos[]
   table.  If Bison ever adopts such optimizations in the future (e.g.,
   bison -O2), it will have to fall back to a more reliable method of
   printing rules during reduction, such as with yyprhs[] and yyrhs[].

The optimization I am alluding to removes all "unit reductions" from
the parsing automaton.  These are reductions of the form:

                a: b ;

and that have no action code (other than the default action of $$ = $1).
The symbol b can be either a nonterminal or a token.  Such unit
reductions serve only to symbolically "rename" the symbol at the top of
the stack.  These "no-op" reduce operations can simply be omitted with
no ill effect.  There are lots of other good things that come from doing
this, but that should be a separate discussion.

David


David M. Warme, Ph.D.
Principal Computer Scientist
Group W, Inc.
Fairfax, Virginia, USA.
[Prev in Thread]
Current Thread
[Next in Thread]
Disappearance of rhs[], prhs[], and user control of table generation., David M. Warme, 2011/04/06
- Re: Disappearance of rhs[], prhs[], and user control of table generation., John P. Hartmann, 2011/04/06
  - Re: Disappearance of rhs[], prhs[], and user control of table generation., Akim Demaille, 2011/04/06
    - Re: Disappearance of rhs[], prhs[], and user control of table generation., John P. Hartmann, 2011/04/06
- Re: Disappearance of rhs[], prhs[], and user control of table generation., David M. Warme <=
Prev by Date: Re: FW: bison-2.3 build error: src/bison not found
Next by Date: Problems with an unrecognised Token
Previous by thread: Re: Disappearance of rhs[], prhs[], and user control of table generation.
Next by thread: Re: FW: bison-2.3 build error: src/bison not found
Index(es):
- Date
- Thread