help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposals for various changes to the Java parser


From: Di-an JAN
Subject: Proposals for various changes to the Java parser
Date: Mon, 27 Oct 2008 10:12:41 -0700 (PDT)

Ideas for changing the generated Java parsers.  I can implement most of
these.  Comments on the interface and semantics would be appreciated.


1. The parser class can be declared public and/or abstract by using the
``%define public'' and ``%define abstract'' directives.  I can add the
other Java class modifiers with ``%define final'' and ``%define strictfp''
and ``%define annotations "@..."'' directives (plural, since %define's
are not combined, but must be specified in the same %define) to be complete.
Or, we can have a single ``%define parser_class_modifiers "..."'' to
specify all modifiers together.  Or both.
Implemented ``%define final/strictfp''.


2. The parser class name currently defaults to ``YYParser'' (actually,
``b4_prefixParser'', but I already submitted a patch for that bug).
Do people prefer to make it match the Java file name instead?  Of course,
characters not allowed in Java names must be removed or replaced by ``_''.


3. Add ``%define extends "Super"'' and ``%define implements "Interfaces"''.
Implemented.


4. Add ``%define lex_throws "Exceptions"'' to parse() and not use it as
the default of ``%define throws "Exceptions"''.

Throws specifications are needed when using user code:

parser action   --> yyaction()  --> parse()
%initial-action --> parse()
yylex()         --> yylex()     --> parse()

There are also uses of the other members of Lexer and Position,
but they are unlikely to throw exceptions.

Obviously, parse() gets excptions from both ``throws'' and ``lex_throws''
and there's no reason to make users duplicate ``lex_throws'' in ``throws''.
Also, yyaction() needs ``throws''.
There's probably no need for ``initial_action_throws'', etc.

                        BEFORE                  AFTER

yylex()                 lex_throws              lex_throws
parse()                 throws                  lex_throws, throws
yyaction()              (none)                  throws

default lex_throws      java.io.IOException     java.io.IOException
default throws          lex_throws              (none)

Implemented.


5. Work around Java's ``code too large'' limitation for large parser tables.
http://lists.gnu.org/archive/html/help-bison/2008-10/msg00005.html
Bison could try to estimate how much bytecode is needed and choose
generate code accordingly, but it depends on the actual Java compiler
and the amount of user static initialization, so that's not a good idea.
The syntax in the text (not appendix) of the ``Java Language Specification,
Second Edition'' is just at the limit, depending on how it's converted from
ENBF and conflicts removed.  The awk, cim, and pic grammars from
tests/existing.at are all under the limit.
Implemented ``%define parser_tables "small/medium/large"'' except docs
and tests.  Only need changes to the Java skeleton.


6. Currently ``%union'' is silently ignored, and Java types are used as
the TYPE in ``$<TYPE>'', ``%token<TYPE> ...'' and ``%type<TYPE> ...''.
I propose to interpret these ``<TYPE>'' as a field name in ``%union'',
interpreting it as a Java type if no such field name exists.
First, this matches the behavior of C/C++ parsers, even though Java doesn't
actually have union types.  Also makes it easier to convert from C/C++.
Second, this allows the use of generic types since ``<TYPE>'' does not
allow ``>'' in TYPE.  For example:

%language "Java"
%code imports { import java.util.Vector; }
%union {
String          str;
Vector<String>    list;
}
%token<str> STR
%type<list> str_list
%%
str_list
: STR                   { Vector v = $$ = new Vector<String>(); v.add($1); }
                // not  { $$ = new Vector<String>(); $$.add($1); }
| str_list ',' STR      { $$ = $1; $1.add($3); }
;


7. Should we make sure ``$$'' have the right type?  For example:

%language "Java"
%code imports { import java.util.Vector; }
%token<String> STR
%type<Vector> str_list
%%
str_list
: STR                   { $$ = new Vector(); $$.add($1); }
        // currently    { Vector v = $$ = new Vector(); v.add($1); }
        // or           { $$ = new Vector(); ((Vector)$$).add(v); }
| str_list ',' STR      { $$ = $1; $$.add($3); }
        // currently    { $$ = $1; $1.add($3); }
;

Since Java have no unions, Bison have to use casts, but casts are not
allowed on the LHS of assignments, so ``$$'' have to have the base type,
and must be explicitly cast to the right type by the user (see above).
Instead, we can making ``$$'' a local variable for each action, which
have to be assigned to the actual ``$$'' at the end of the action,
so less efficient.


8. Remove ``public static final boolean bison = true;''  This corresponds
to ``#define YYBISON 1'' in C parsers, which can be used for conditional
compilation.  There is no conditional compilation in Java, though you
probably can use reflection.  Might as well use ``bisonVersion'' anyway.
Or keep it for compatibility, and document as ``public'' interface.


9. Document ``bisonVersion'' and ``bisonSkeleton'' as part of the ``public''
interface.


10. If ``%verbose-error'' is not used, do not generate code for it.
Document that ``errorVerbose'' can be changed given ``%verbose-error''.
Or make it ``yyErrorVerbose'' and provide getter and setter like ``%debug''.


11. If ``%debug'' or -t/--debug is not used, do not generate code for it.
How to turn debugging on and off is already documented.


12. Don't generate token names when not needed.  If ``%token-table''
or -k/--token-table is used, also generate the following function:

/** Returns the token number (for returning from yylex) for NAME.
    NAME does not have to be quoted, but when unquoted, it is first
    matched with a double-quoted literal string token, then a single-quoted
    character token type, then a named token type.  */
public int getTokenNumber(String name);

/** Returns the token number (for returning from yylex) for NAME.
    NAME may not have to be quoted, and only match a double-quoted
    literal string token from the grammar.  */
public int getStringTokenNumber(String name);

By the way, the example in the ``Interface / Lexical / Calling Convention''
node of the manual is wrong.  It gives the internal token number, not the
ones returned by yylex.  It needs to pass though ``yytoknum''.
C/C++ probably should define these functions too.

The point is that that Bison should provide what people need, so they
don't have mess with the internal structures.


13. Make the yyerror functions public.  Otherwise, it's a member function
of the lexer, which is not available if defined with ``%code lexer {...}''.
Even without ``%code lexer {...}'', users shouldn't have to save a reference
to the lexer just to call its yyerror when it's already saved in the parser.


14. Allow user-defined location class.  Currently, you can only change its
name by using ``%define location_type''.  We need this to print abbreviated
ranges, where common file names and line numbers are not printed.  Or to
use encoded int or long for Position.
Perhaps ``%define location_type'' can be changed to mean that the default
class should not be generated.  Or use ``%define no_default_location_type''.


15. Defines a default position class with line and column, as in C, unless
``%define position_type'' is used.  This is not backward compatible though.
To be fully backward compatible, we need ``%define default_position_type''
and ``%define no_default_location_type''.  Ugly.


16. Support ``%printer''.  Even though the virtual toString() method is
used to print symbols, symbols of the same semantic type may have to be
displayed differently (for example, see Bison's grammar of itself) and
we shouldn't have to define new (sub)classes just to print differently.
Also, we may want something different from the ``natural'' toString(),
for example, quoted characters and strings (can't even override the
``toString()'' of Character and String because they're final classes).


17. Allow gcj version < 4.3 to be used in the testsuite.  Currently
(in gnulib/m4/javacomp.m4), for gcj version < 4.3, only
``-source 1.4 && -target 1.4'' and ``-source 1.3 && -target 1.4''
are allowed.  According th the comments there, these gcj really does
target 1.4.  I guess we can relax that, or use -source 1.3 -target 1.4
for bison's configure.ac ``gt_JAVACOMP([1.3], [1.4])''.
I check that ``javac -source 1.3 -target 1.4'' works with JDK 1.6.
By the way, gcj 3.4.4 is the ``system'' compiler on Cygwin.


18. Tweak the testsuite to allow testing whether Bison plays nice with
generics and other Java version > 1.3 features.  Currently everything
is tested with -source 1.3 hard-coded into ``javacomp.sh'', which is
a good check that no newer language features are introduced (don't work
with newer library features, of course).

By the way, Bison doesn't plays nice with generics: '>' is not allowed
in ``<TYPE>'' by the syntax, and using a generic ``%define stype'' gives
a ``generic array creation'' error, and at least one place needs to
m4-quote commas in ``<TYPE>'' (fails with Map/*String,String*/).


19. On Cygwin, ``make check'' fails when using ``java'' because it generates
CR/LF output while autotest uses LF. Maybe do a ``sed -e 's/\r$//''' on the
output?


Di-an Jan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]