bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bison scanner patch to fix POSIX incompatibilities, etc.


From: Akim Demaille
Subject: Re: Bison scanner patch to fix POSIX incompatibilities, etc.
Date: 06 Nov 2002 10:28:01 +0100
User-agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Honest Recruiter)

| Does anyone know what the C++ rules are with respect to trigraphs,
| digraphs, and backslash-newline?  Does C++ have trigraphs?

Yes, it does.  I don't know what digraphs are in C, but the draft I'm
looking at says that

        <%      {
        %>      }
        <:      [
        :>      ]
        %:      #
        %:%:    ##
        and     &&

etc.

As for comments, I only found:

        [lex.comment] 2.7 Comments The characters /* start a comment,
        which terminates with the characters */. These comments do not
        nest. The characters // start a comment, which terminates with
        the next newline character. If there is a formfeed or a
        verticaltab character in such a comment, only whitespace
        characters shall appear between it and the newline that
        terminates the comment; no diagnostic is required. [Note: The
        comment characters //, /*, and */ have no special meaning
        within a // comment and are treated just like other
        characters. Similarly, the comment characters // and /* have
        no special meaning within a /* comment. ]



| In hand-written code I have seen it only for strings (and in the
| International Obfuscated C code contest -- I won a prize there once
| :-).  

Congrats :)  I didn't know that.  Puke...

#define C char
#define F X,perror("oops"),1
#define G getchar()
#define I ;if(
#define P putchar
#define Q 256
#define W ;while(
#define X return 0
#include<stdio.h>
long M,N,c,f,m,o,r,s,w;y(l){o^=l;m+=l+1;f=f*2+l+(f>>31&1);}int
O,S,e,i,k,n,q,t[69001];b(g){k=4 W g<k)y(P((C)(w>>--k*8)&255));w=0;}C D[Q*Q],h
[Q*Q];main(g,V)C**V;{I**V-97)X,a()W G-10)W(g=G)+1&&g-'x')if(g-10){I
4<k)b(0)I g>32&g<'v')w=w*85+g-33,++k;else{I
g-'z'|k)F;w=0;k=5;}}W G-78)I scanf("%ld%lx E%lx S%lx R%lx ",&M,&N,&c,&s,&r)-5)F
I M){b(g=3-(M-1&3))W g--)y(0);}I(M-N|c-o|s-m|r-f)&4294967295)F;X;}long
g(){C*p I m<f&n<k&&(m=(1L<<++n)-1)||O>=S){O=0;S=fread(D,1,n,stdin)*8 I
S<8)X-1;S-=n-1;}p=D+O/8;q=O&7;O+=n;X,(1<<8-q)-1&*p>>q|m&((15<n+q)*p[2]*Q|p[1]&
255)<<8-q;}a(){C*p=D+Q;G;G;k=G;e=k>>7&1;k&=31 I k>16)F;w=Q
W w--)t[w]=0,h[w]=w;n=8;f=Q+e;i=o=w=g()I o<0)X,1;P(i)W(w=g())+1){I
w==Q&e){W w--)t[w]=0;m=n=8;f=Q I(w=g())<0)X;}c=w
I w>=f)*p++=i,w=o W w>=Q)*p++=h[w],w=t[w];P(i=h[w])W
p>D+Q)P(*--p)I(w=f)<1L<<k)t[w]=o,h[f++]=i;o=c;}X;}



| > all this is free complications that might make things uselessly more
| > complex when we will "parse" other languages with different lexical
| > structures.
| 
| Languages with different lexical structures will need different
| lexical regimes anyway.  You can't parse Scheme with a C lexer.
| 
| Or perhaps you were thinking we could get away with parsing C++ and/or
| Java with a C lexer?  That is plausible.

Yes, I believe there is enough regularity so that we can go very far
accross languages.  And imposing some restriction on the {} content is
not a severe restriction anyway: there should be little code in the
parser, mainly function calls.

| We can certainly construct horrible examples of C statements that no
| Yacc parser could reasonably be expected to parse.  For example:
| 
|   %{
|   #define CLOSE_BRACE }
|   %}
|   %%
|   start: 'x' { { $$ = 0; CLOSE_BRACE } ;

Puke :)

| Or how about this one?
| 
|   %{
|   #define PERCENT_CLOSE_BRACE %}
|   %}
|   %%
|   start: 'x' ;

So you *are* a nasty guy :)

| POSIX says that both these are valid input to Yacc!  Clearly this is a
| bug in the standard, but the question is how far the bug extends.
| 
| How about if we propose the following changes to the POSIX standard:
| 
|  * The two characters `%' and `}' cannot appear adjacent within %{
|    ... %}, other than inside a comment or string literal.

Good.

|  * C-language code must have properly nested occurrences of "{" and
|    "}".  If braces are spelled any other way (e.g., via a macro like
|    CLOSE_BRACE or via a digraph) then the resulting behavior is
|    undefined.

Actually, why make it undefined?  Why not saying that it is just not
recognize.  So that:

|   %{
|   #define CLOSE_BRACE }
|   %}
|   %%
|   start: 'x' { $$ = 0; CLOSE_BRACE } ;

remains Yacc-valid, but maybe C-invalid.


|  * Backslash-newline preprocessing does not apply to Yacc comments or
|    literals, or to occurrences of pseudovariables like `$$'.

Yep.

|  * Similarly, trigraph preprocessing does not apply to Yacc comments,
|    literals, or pseudovariables.

Why keeping digraphs?

|  * If the removal of a backslash-newline within C-language code would
|    change the boundary of the containing comment, string literal, or
|    character constant, the resulting behavior is undefined.
| 
|  * Similarly, if the replacement of a trigraph by its corresponding
|    single character in C-language code would change the boundary of
|    the containing comment, string literal, or character constant, the
|    resulting behavior is undefined.
| 
| Would that be OK with you?

Sounds good to me.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]