[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bison scanner patch to fix POSIX incompatibilities, etc.
From: |
Akim Demaille |
Subject: |
Re: Bison scanner patch to fix POSIX incompatibilities, etc. |
Date: |
06 Nov 2002 10:28:01 +0100 |
User-agent: |
Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Honest Recruiter) |
| Does anyone know what the C++ rules are with respect to trigraphs,
| digraphs, and backslash-newline? Does C++ have trigraphs?
Yes, it does. I don't know what digraphs are in C, but the draft I'm
looking at says that
<% {
%> }
<: [
:> ]
%: #
%:%: ##
and &&
etc.
As for comments, I only found:
[lex.comment] 2.7 Comments The characters /* start a comment,
which terminates with the characters */. These comments do not
nest. The characters // start a comment, which terminates with
the next newline character. If there is a formfeed or a
verticaltab character in such a comment, only whitespace
characters shall appear between it and the newline that
terminates the comment; no diagnostic is required. [Note: The
comment characters //, /*, and */ have no special meaning
within a // comment and are treated just like other
characters. Similarly, the comment characters // and /* have
no special meaning within a /* comment. ]
| In hand-written code I have seen it only for strings (and in the
| International Obfuscated C code contest -- I won a prize there once
| :-).
Congrats :) I didn't know that. Puke...
#define C char
#define F X,perror("oops"),1
#define G getchar()
#define I ;if(
#define P putchar
#define Q 256
#define W ;while(
#define X return 0
#include<stdio.h>
long M,N,c,f,m,o,r,s,w;y(l){o^=l;m+=l+1;f=f*2+l+(f>>31&1);}int
O,S,e,i,k,n,q,t[69001];b(g){k=4 W g<k)y(P((C)(w>>--k*8)&255));w=0;}C D[Q*Q],h
[Q*Q];main(g,V)C**V;{I**V-97)X,a()W G-10)W(g=G)+1&&g-'x')if(g-10){I
4<k)b(0)I g>32&g<'v')w=w*85+g-33,++k;else{I
g-'z'|k)F;w=0;k=5;}}W G-78)I scanf("%ld%lx E%lx S%lx R%lx ",&M,&N,&c,&s,&r)-5)F
I M){b(g=3-(M-1&3))W g--)y(0);}I(M-N|c-o|s-m|r-f)&4294967295)F;X;}long
g(){C*p I m<f&n<k&&(m=(1L<<++n)-1)||O>=S){O=0;S=fread(D,1,n,stdin)*8 I
S<8)X-1;S-=n-1;}p=D+O/8;q=O&7;O+=n;X,(1<<8-q)-1&*p>>q|m&((15<n+q)*p[2]*Q|p[1]&
255)<<8-q;}a(){C*p=D+Q;G;G;k=G;e=k>>7&1;k&=31 I k>16)F;w=Q
W w--)t[w]=0,h[w]=w;n=8;f=Q+e;i=o=w=g()I o<0)X,1;P(i)W(w=g())+1){I
w==Q&e){W w--)t[w]=0;m=n=8;f=Q I(w=g())<0)X;}c=w
I w>=f)*p++=i,w=o W w>=Q)*p++=h[w],w=t[w];P(i=h[w])W
p>D+Q)P(*--p)I(w=f)<1L<<k)t[w]=o,h[f++]=i;o=c;}X;}
| > all this is free complications that might make things uselessly more
| > complex when we will "parse" other languages with different lexical
| > structures.
|
| Languages with different lexical structures will need different
| lexical regimes anyway. You can't parse Scheme with a C lexer.
|
| Or perhaps you were thinking we could get away with parsing C++ and/or
| Java with a C lexer? That is plausible.
Yes, I believe there is enough regularity so that we can go very far
accross languages. And imposing some restriction on the {} content is
not a severe restriction anyway: there should be little code in the
parser, mainly function calls.
| We can certainly construct horrible examples of C statements that no
| Yacc parser could reasonably be expected to parse. For example:
|
| %{
| #define CLOSE_BRACE }
| %}
| %%
| start: 'x' { { $$ = 0; CLOSE_BRACE } ;
Puke :)
| Or how about this one?
|
| %{
| #define PERCENT_CLOSE_BRACE %}
| %}
| %%
| start: 'x' ;
So you *are* a nasty guy :)
| POSIX says that both these are valid input to Yacc! Clearly this is a
| bug in the standard, but the question is how far the bug extends.
|
| How about if we propose the following changes to the POSIX standard:
|
| * The two characters `%' and `}' cannot appear adjacent within %{
| ... %}, other than inside a comment or string literal.
Good.
| * C-language code must have properly nested occurrences of "{" and
| "}". If braces are spelled any other way (e.g., via a macro like
| CLOSE_BRACE or via a digraph) then the resulting behavior is
| undefined.
Actually, why make it undefined? Why not saying that it is just not
recognize. So that:
| %{
| #define CLOSE_BRACE }
| %}
| %%
| start: 'x' { $$ = 0; CLOSE_BRACE } ;
remains Yacc-valid, but maybe C-invalid.
| * Backslash-newline preprocessing does not apply to Yacc comments or
| literals, or to occurrences of pseudovariables like `$$'.
Yep.
| * Similarly, trigraph preprocessing does not apply to Yacc comments,
| literals, or pseudovariables.
Why keeping digraphs?
| * If the removal of a backslash-newline within C-language code would
| change the boundary of the containing comment, string literal, or
| character constant, the resulting behavior is undefined.
|
| * Similarly, if the replacement of a trigraph by its corresponding
| single character in C-language code would change the boundary of
| the containing comment, string literal, or character constant, the
| resulting behavior is undefined.
|
| Would that be OK with you?
Sounds good to me.
- Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/03
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/04
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/04
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/05
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/05
- Re: Bison scanner patch to fix POSIX incompatibilities, etc.,
Akim Demaille <=
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/06
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/07
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/05
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/06
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/06
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/07
Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/04