help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to decide what to put in the lexer and the grammar respectively?


From: Peng Yu
Subject: Re: How to decide what to put in the lexer and the grammar respectively?
Date: Sun, 17 Feb 2019 10:36:21 -0600

Sometimes, the best implementation may not be what it obviously should
look like. I think that there can be cases in which more actions
should be in the lexer instead of the parser.

Consider the parameter expansion (along with assignment) in bash.

https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html

Bash currently doesn't support nested parameter expansion. For
example, both the following assignment commands are fine.

- x=${parameter:offset}
- x=${parameter/pattern/string}

But bash does not support something like the following (i.e, extract
the substring first then do replacement of the substring)

- x=${{parameter:offset}/pattern/string}

Since bash does not support the nested parameter expansion, bash
tokenizes the whole thing (e.g, `x=${parameter:offset}` literally) as
an ASSIGNMENT_WORD. Then, bash post-processes the thing extracted with
some handwritten code but not another pair of full-fledged
lexer/parser. The post-processing code is still manageable due to the
simplicity of not allowing nesting.

However, if one wants to allow arbitrary nesting, a handwritten
post-processing approach will not be viable. A lexer/parser
post-processing approach is more manageable. For example, the
post-processing lexer can recognize basic tokens (once the whole
assignment parameter expansion string is extracted by the first lexer)
like the following. Then a parser can be built on top of those basic
tokens.

- VARNAME
- `=`
- DOLLOR_LPAR
- `{`
- `}`
- `/`
- `:`
- ...

But how to recognize the nested parameter expansion assignment in the
first place? The lexer should have builtin states to capture paired
`{` `}`, and use states to remember whether it is in substring
extraction or pattern replacement in order to make sure to capture any
errors at the level of the lexer. But once such complex stuff is
builtin in the first lexer, I don't see the point to use a second
lexer and parser to do post-processing, since the same states will be
constructed at least twice, which is a waste.

In this sense, it seems nested parameter expansion assignment should
be builtin in the first lexer rather than the first parser. And the
parser should be left for other more high-level language features.

If instead, the first lexer don't spit out ASSIGNMENT_WORD, but rather
more basic tokens (as listed above, such as VARNAME ... DOLLOR_LPAR),
then it will need to remember when to allow spaces (as spaces are used
as separators in other context but not allowed in the assignment). So
the lexer basically needs to be aware of more high-level language
information, which may not be good. In this sense, I think that the
lexer should still spit the ASSIGNMENT_WORD but do more internal work
to deal with the actions of the assignment and nested parameter
expansions.

What do you think the best implement should be in this case?

On Sun, Feb 17, 2019 at 8:44 AM Akim Demaille <address@hidden> wrote:
>
> Hi!
>
> > Le 17 févr. 2019 à 14:08, Peng Yu <address@hidden> a écrit :
> >
> > Hi,
> >
> > The more I study flex/bison, the more confused I got about when to use
> > what actions and what to put in lexer and what to put in grammar.
>
> Usually it's quite clear: build words from letters in the scanner,
> build sentences from words in the parser.  By "words", I mean
> numbers, identifiers, keywords, strings, operators, etc.
>
> > For example, for an assignment,
> >
> > x=10
> >
> > it can be processed by the lexer,
> >
> > [[:alpha:]_][[:alnum:]_]=[[:digit:]+]  { /* parse yytext to get the
> > name and value, then do the assignment */ }
>
> If you're language is really simple and you can live with that,
> then you can.  If your language has a rich structure, then it would
> be silly.
>
> Note that currently have don't support white spaces (including
> \n), you don't support comment (in C, I can write "x /* var */
> = /* value */ 42), and then you will have to traverse _again_
> your "assignment" to find the lhs and rhs.
>
> Clearly, define what is an identifier in the scanner, then
> define their uses in the parser.
>
>
> > There just seem to be too many possible ways of implementations.
>
> As if often the case in computer science :)  But here, there is
> a well established and well know best practice.
>
> Cheers!



-- 
Regards,
Peng



reply via email to

[Prev in Thread] Current Thread [Next in Thread]