l-lang-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[L-lang-devel] New parser generator commited


From: Matthieu Lemerre
Subject: [L-lang-devel] New parser generator commited
Date: Wed, 09 May 2007 23:27:08 +0200
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.95 (gnu/linux)


L contains a new parser generator (or "compiler compiler"), which is a
lexerless LL(1) (future LL(*)) parser.  Its role is to build a syntax
tree directly from a sequence of token.

It is rather simple for now, but powerful enough to parse its own
grammar. It can be used as an efficient lexer, but also as a complete
parser.

It is designed along the "simplicity, readability, efficiency, and
expressivity" rules that are L's motto.

It is simple because it converts the parser description into the
recursive descent parser you would have written if you had to write a
lexerless parser; but it saves all the tedious work of computing the
"head and follow sets" of the different rules.

It is efficient because semantic values are built incrementally, by
contrast with the common practice of "first parsing, then building the
value". 

For instance, the decimal number parser is written that way:

  grammar decimal_number = {
    rule Digit:Int = <d:[0-9]> { d - Character( '0')} ;
    rule Unsigned_Number = <n:{0}> (<d:Digit> {n*=10; n+=d})* {n} ;
  }

and the hexadecimal one is:

  grammar rule hexadecimal_number = {
    rule Hexa_Digit = (<d:[0-9]> { d - Character( '0')}
                       | <d:[a-f]> { d - Character( 'a') + 10}) ;
    rule Hexa_Number = <n:{0}> (<d:Hexa_Digit> { n = n << 4; n += d })* {n} ;
  }

The returned value is build incrementally; there is no need to treat
everything after the parse as lex does.  Also, matched text is not
stored temporarily in any buffer.

I also think that the fact that lexerless parsing is more
efficient. The reason is that a lexer/parser couple has to make more
tests : the lexer is not aware of the parser context (so does not know
which characters may be ahead), so makes more tests; and there is a
second test in the parser of the value returned by the lexer.  

The general idea of the parser comes from Henry Baker's meta
(http://home.pipeline.com/~hbaker1/Prag-Parse.html); although the
presented parser is more powerful because it allows recursive
composition of rules.

Now let's see how it works: a grammar is composed of a set of grammar
rules, which are themselves defined by a sequence of grammar rules.

The different base grammar rules are:

- The character set: works like regexp character sets. For instance,
  [f-i_] matches either 'f', 'g', 'h', 'i', or '_'.

- The sequence: the first rule must be matched, then the second
  rule. For instance, [a] [b] matches "ab" and [ad] [b] [c] matches
  "abc" or "dbc".

- The rule name: a rule can call a rule in the same grammar, including
  itself. There are some restrictions due to the nature of the parser,
  and the grammar cannot be left-recursive though; this basically mean
  that one cannot start a rule definition by a recursion.

For instance, rule Twodigit = Digit Digit matches "23"; but rule
Twodigit = Twodigit Digit is impossible.

- A semantic action: when the previous rule has been matched, the
  content of a semantic action will be executed. However, one should
  not create variables with the semantic actions; use the "grouping
  rule" for that.

- A grouping: grouping allow to declare variables and to store
  semantic actions into values. For instance, <d:[0-9]> stores a
  character into d; <n:{0}> initializes n to 0. n can be used in
  subsequent semantic actions.

- An alternative: allows to parse differently according to the
  following characters.  For instance, 

     ([a-f] {print ("TOTO\n");} | [g-k] {print("TATA")}); 

will print TOTO if the next character is in [a-f]; TATA if it is in
[g-k]; else it will throw a parse error.

- A kleene star: apply the rule while the first character is in the
  head set of the rule. For instance,

     (Digit {print("DIGIT\n")})*

will print DIGIT\nDIGIT\nDIGIT\n if given as an input "123t135"

The other are abbreviations:

- Rule? is equivalent to (Rule | )

- Rule+ is equivalent to Rule Rule*

- "aoeu" is equivalent to [a] [o] [e] [u] for now.

Last thing to know is that the last value returned by a grammar rule
is returned by the scanner. Also, in the parser grammar, ' ' binds
more tightly than |, but looser than *,?,and +. So most alternatives
must be enclosed in parenthesis; so must repetitive and conditional
rules.


I will update the parser to make it LL(*). Most likely, I will extend
the head set computation and introduce a new operator : ->, meaning
"followed by", and "-->", meaning "followed by don't advance." Their
meaning is like that of the sequence operator ' ', except in head set
computation, where they mean that all characters have to be matched to
enter the rule.

Difference between -> and --> is that characters after the first -->
will be parsed twice.

Other extensions may include semantic and syntaxic predicates, as well
as passing values from the calling rule to the called rule. Another
one would be error handling to continue parsing despite an error;
people seem to like that, even if I think that the interactive nature
of L renders that useless.


I'm now going to use (and maybe extend ) this parser to parse all of L
grammar.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]