Bison scanner patch to fix POSIX incompatibilities, etc.

bison-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Bison scanner patch to fix POSIX incompatibilities, etc.

From:	Paul Eggert
Subject:	Bison scanner patch to fix POSIX incompatibilities, etc.
Date:	Sun, 3 Nov 2002 00:49:18 -0800 (PST)
I installed the following (unfortunately lengthy) patch to fix several
minor bugs with the Bison scanner.  For example, the scanner
mishandled backslash-newline in C actions, and it miscounted columns
and lines in several circumstances.  While I was at it, I documented
the scanner a bit better (e.g., I documented that it doesn't do
trigraphs).


2002-11-03  Paul Eggert  <address@hidden>

        * src/scan-gram.l: Revamp to fix POSIX incompatibilities,
        to count columns correctly, and to check for invalid inputs.
        
        Use mbsnwidth to count columns correctly.  Account for tabs, too.
        Include mbswidth.h.
        (YY_USER_ACTION): Invoke extend_location rather than LOCATION_COLUMNS.
        (extend_location): New function.
        (YY_LINES): Remove.

        Handle CRLF in C code rather than in Lex code.
        (YY_INPUT): New macro.
        (no_cr_read): New function.

        Scan UCNs, even though we don't fully handle them yet.
        (convert_ucn_to_byte): New function.

        Handle backslash-newline correctly in C code.
        (SC_LINE_COMMENT, SC_YACC_COMMENT): New states.
        (eols, blanks): Remove.  YY_USER_ACTION now counts newlines etc.;
        all uses changed.
        (tag, splice): New EREs.  Do not allow NUL or newline in tags.
        Use {splice} wherever C allows backslash-newline.
        YY_STEP after space, newline, vertical-tab.
        ("/*"): BEGIN SC_YACC_COMMENT, not yy_push_state (SC_COMMENT).
        
        (letter, id): Don't assume ASCII; e.g., spell out a-z.

        ({int}, handle_action_dollar, handle_action_at): Check for integer
        overflow.
        
        (YY_STEP): Omit trailing semicolon, so that it's more like C.

        (<SC_ESCAPED_STRING,SC_ESCAPED_CHARACTER>): Allow \0 and \00
        as well as \000.  Check for UCHAR_MAX, not 255.
        Allow \x with an arbitrary positive number of digits, as in C.
        Check for overflow here.
        Allow \? and UCNs, for compatibility with C.

        (handle_symbol_code_dollar): Use quote_n slot 1 to avoid collision
        with quote slot used by complain_at.

        * tests/input.at: Add tests for backslash-newline, m4 quotes
        in symbols, long literals, and funny escapes in strings.

        * configure.ac (jm_PREREQ_MBSWIDTH): Add.
        * lib/Makefile.am (libbison_a_SOURCES): Add mbswidth.h, mbswidth.c.
        * lib/mbswidth.h, lib/mbswidth.c: New files, from GNU gettext.
        * m4/Makefile.am (EXTRA_DIST): Add mbswidth.m4.
        * m4/mbswidth.m4: New file, from GNU coreutils.

        * doc/bison.texinfo (Grammar Outline): Document // comments.
        (Symbols): Document that trigraphs have no special meaning in Bison,
        nor is backslash-newline allowed.
        (Actions): Document that trigraphs have no special meaning.

        * src/location.h (LOCATION_COLUMNS, LOCATION_LINES): Remove;
        no longer used.

Index: configure.ac
===================================================================
RCS file: /cvsroot/bison/bison/configure.ac,v
retrieving revision 1.19
diff -p -u -r1.19 configure.ac
--- configure.ac        25 Oct 2002 06:56:26 -0000      1.19
+++ configure.ac        3 Nov 2002 08:33:21 -0000
@@ -89,6 +89,7 @@ AC_REPLACE_FUNCS(memchr memrchr \
                  strchr stpcpy strrchr strspn strtol)
 AC_FUNC_MALLOC
 AC_FUNC_REALLOC
+jm_PREREQ_MBSWIDTH
 jm_PREREQ_QUOTEARG
 jm_FUNC_ARGMATCH
 jm_PREREQ_ERROR
Index: doc/bison.texinfo
===================================================================
RCS file: /cvsroot/bison/bison/doc/bison.texinfo,v
retrieving revision 1.73
diff -p -u -r1.73 bison.texinfo
--- doc/bison.texinfo   23 Oct 2002 05:26:32 -0000      1.73
+++ doc/bison.texinfo   3 Nov 2002 08:33:23 -0000
@@ -2212,6 +2212,8 @@ appropriate delimiters:
 @end example
 
 Comments enclosed in @samp{/* @dots{} */} may appear in any of the sections.
+As a @acronym{GNU} extension, @samp{//} introduces a comment that
+continues until end of line.
 
 @menu
 * Prologue::          Syntax and usage of the prologue.
@@ -2360,7 +2362,9 @@ All the usual escape sequences used in c
 used in Bison as well, but you must not use the null character as a
 character literal because its numeric code, zero, signifies
 end-of-input (@pxref{Calling Convention, ,Calling Convention
-for @code{yylex}}).
+for @code{yylex}}).  Also, unlike standard C, trigraphs have no
+special meaning in Bison character literals, nor is backslash-newline
+allowed.
 
 @item
 @cindex string token
@@ -2387,9 +2391,10 @@ does not enforce this convention, but if
 read your program will be confused.
 
 All the escape sequences used in string literals in C can be used in
-Bison as well.  A literal string token must contain two or more
-characters; for a token containing just one character, use a character
-token (see above).
+Bison as well.  However, unlike Standard C, trigraphs have no special
+meaning in Bison string literals, nor is backslash-newline allowed.  A
+literal string token must contain two or more characters; for a token
+containing just one character, use a character token (see above).
 @end itemize
 
 How you choose to write a terminal symbol has no effect on its
@@ -2691,7 +2696,13 @@ is to compute a semantic value for the g
 semantic values associated with tokens or smaller groupings.
 
 An action consists of C statements surrounded by braces, much like a
-compound statement in address@hidden  It can be placed at any position in the 
rule;
+compound statement in address@hidden  An action can contain any sequence of C
+statements.  Bison does not look for trigraphs, though, so if your C
+code uses trigraphs you should ensure that they do not affect the
+nesting of braces or the boundaries of comments, strings, or character
+literals.
+
+An action can be placed at any position in the rule;
 it is executed at that position.  Most rules have just one action at the
 end of the rule, following all the components.  Actions in the middle of
 a rule are tricky and used only for special purposes (@pxref{Mid-Rule
Index: lib/Makefile.am
===================================================================
RCS file: /cvsroot/bison/bison/lib/Makefile.am,v
retrieving revision 1.33
diff -p -u -r1.33 Makefile.am
--- lib/Makefile.am     20 Oct 2002 06:29:41 -0000      1.33
+++ lib/Makefile.am     3 Nov 2002 08:33:23 -0000
@@ -34,6 +34,7 @@ libbison_a_SOURCES = \
   basename.c dirname.h dirname.c \
   getopt.h getopt.c getopt1.c \
   hash.h hash.c \
+  mbswidth.h mbswidth.c \
   quote.h quote.c quotearg.h quotearg.c \
   subpipe.h subpipe.c unlocked-io.h \
   xalloc.h xmalloc.c xstrdup.c xstrndup.c \
Index: m4/Makefile.am
===================================================================
RCS file: /cvsroot/bison/bison/m4/Makefile.am,v
retrieving revision 1.22
diff -p -u -r1.22 Makefile.am
--- m4/Makefile.am      22 Oct 2002 04:38:11 -0000      1.22
+++ m4/Makefile.am      3 Nov 2002 08:33:23 -0000
@@ -1,6 +1,6 @@
 ## Process this file with automake to produce Makefile.in -*-Makefile-*-
 EXTRA_DIST = \
   dmalloc.m4 error.m4 \
-  m4.m4 mbrtowc.m4 memcmp.m4 \
+  m4.m4 mbrtowc.m4 mbswidth.m4 memcmp.m4 \
   prereq.m4 stdbool.m4 subpipe.m4 timevar.m4 warning.m4 \
   gettext.m4 iconv.m4 lib-ld.m4 lib-link.m4 lib-prefix.m4 progtest.m4
Index: src/location.h
===================================================================
RCS file: /cvsroot/bison/bison/src/location.h,v
retrieving revision 1.3
diff -p -u -r1.3 location.h
--- src/location.h      9 Jul 2002 16:24:57 -0000       1.3
+++ src/location.h      3 Nov 2002 08:33:23 -0000
@@ -40,20 +40,6 @@ do {                                         \
   (Loc).last_column =  (Loc).last_line = 1;    \
 } while (0)
 
-/* Advance of NUM columns. */
-# define LOCATION_COLUMNS(Loc, Num)            \
-do {                                           \
-  (Loc).last_column += Num;                    \
-} while (0)
-
-
-/* Advance of NUM lines. */
-# define LOCATION_LINES(Loc, Num)              \
-do {                                           \
-  (Loc).last_column = 1;                       \
-  (Loc).last_line += Num;                      \
-} while (0)
-
 
 /* Restart: move the first cursor to the last position. */
 # define LOCATION_STEP(Loc)                    \
Index: src/scan-gram.l
===================================================================
RCS file: /cvsroot/bison/bison/src/scan-gram.l,v
retrieving revision 1.29
diff -p -u -r1.29 scan-gram.l
--- src/scan-gram.l     21 Oct 2002 05:30:50 -0000      1.29
+++ src/scan-gram.l     3 Nov 2002 08:33:24 -0000
@@ -24,6 +24,7 @@
 
 %{
 #include "system.h"
+#include "mbswidth.h"
 #include "complain.h"
 #include "quote.h"
 #include "getargs.h"
@@ -39,9 +40,95 @@ do {                                         \
   if (yycontrol) {;};                          \
 } while (0)
 
-#define YY_USER_ACTION  LOCATION_COLUMNS (*yylloc, yyleng);
-#define YY_LINES        LOCATION_LINES (*yylloc, yyleng);
-#define YY_STEP         LOCATION_STEP (*yylloc);
+#define YY_USER_ACTION  extend_location (yylloc, yytext, yyleng);
+#define YY_STEP         LOCATION_STEP (*yylloc)
+
+#define YY_INPUT(buf, result, size) ((result) = no_cr_read (yyin, buf, size))
+
+
+/* Read bytes from FP into buffer BUF of size SIZE.  Return the
+   number of bytes read.  Remove '\r' from input, treating \r\n
+   and isolated \r as \n.  */
+
+static size_t
+no_cr_read (FILE *fp, char *buf, size_t size)
+{
+  size_t s = fread (buf, 1, size, fp);
+  if (s)
+    {
+      char *w = memchr (buf, '\r', s);
+      if (w)
+       {
+         char const *r = ++w;
+         char const *lim = buf + s;
+
+         for (;;)
+           {
+             /* Found an '\r'.  Treat it like '\n', but ignore any
+                '\n' that immediately follows.  */
+             w[-1] = '\n';
+             if (r == lim)
+               {
+                 int ch = getc (fp);
+                 if (ch != '\n' && ungetc (ch, fp) != ch)
+                   break;
+               }
+             else if (*r == '\n')
+               r++;
+
+             /* Copy until the next '\r'.  */
+             do
+               {
+                 if (r == lim)
+                   return w - buf;
+               }
+             while ((*w++ = *r++) != '\r');
+           }
+
+         return w - buf;
+       }
+    }
+
+  return s;
+}
+
+
+/* Extend *LOC to account for token TOKEN of size SIZE.  */
+
+static void
+extend_location (location_t *loc, char const *token, int size)
+{
+  int line = loc->last_line;
+  int column = loc->last_column;
+  char const *p0 = token;
+  char const *p = token;
+  char const *lim = token + size;
+
+  for (p = token; p < lim; p++)
+    switch (*p)
+      {
+      case '\r':
+       /* \r shouldn't survive no_cr_read.  */
+       abort ();
+
+      case '\n':
+       line++;
+       column = 1;
+       p0 = p + 1;
+       break;
+
+      case '\t':
+       column += mbsnwidth (p0, p - p0, 0);
+       column += 8 - ((column - 1) & 7);
+       p0 = p + 1;
+       break;
+      }
+
+  loc->last_line = line;
+  loc->last_column = column + mbsnwidth (p0, p - p0, 0);
+}
+
+
 
 /* STRING_OBSTACK -- Used to store all the characters that we need to
    keep (to construct ID, STRINGS etc.).  Use the following macros to
@@ -91,17 +178,26 @@ static void handle_dollar (braced_code_t
                           char *cp, location_t location);
 static void handle_at (braced_code_t code_kind,
                       char *cp, location_t location);
+static int convert_ucn_to_byte (char const *hex_text);
 
 %}
-%x SC_COMMENT
+%x SC_COMMENT SC_LINE_COMMENT SC_YACC_COMMENT
 %x SC_STRING SC_CHARACTER
 %x SC_ESCAPED_STRING SC_ESCAPED_CHARACTER
 %x SC_BRACED_CODE SC_PROLOGUE SC_EPILOGUE
 
-id      [.a-zA-Z_][.a-zA-Z_0-9]*
+letter  [.abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_]
+id      {letter}({letter}|[0-9])*
 int     [0-9]+
-eols     (\n|\r|\n\r|\r\n)+
-blanks   [ \t\f]+
+
+/* POSIX says that a tag must be both an id and a C union member, but
+   historically almost any character is allowed in a tag.  We disallow
+   NUL and newline, as this simplifies our implementation.  */
+tag     [^\0\n>]+
+
+/* Zero or more instances of backslash-newline.  Following GCC, allow
+   white space between the backslash and the newline.  */
+splice  (\\[ \f\t\v]*\n)*
 
 %%
 %{
@@ -136,7 +232,7 @@ blanks   [ \t\f]+
   "%nterm"                return PERCENT_NTERM;
   "%output"               return PERCENT_OUTPUT;
   "%parse-param"          return PERCENT_PARSE_PARAM;
-  "%prec"                 { rule_length--; return PERCENT_PREC; }
+  "%prec"                 rule_length--; return PERCENT_PREC;
   "%printer"              return PERCENT_PRINTER;
   "%pure"[-_]"parser"     return PERCENT_PURE_PARSER;
   "%right"                return PERCENT_RIGHT;
@@ -152,20 +248,31 @@ blanks   [ \t\f]+
   "%yacc"                 return PERCENT_YACC;
 
   "="                     return EQUAL;
-  ":"                     { rule_length = 0; return COLON; }
-  "|"                     { rule_length = 0; return PIPE; }
+  ":"                     rule_length = 0; return COLON;
+  "|"                     rule_length = 0; return PIPE;
   ","                     return COMMA;
   ";"                     return SEMICOLON;
 
-  {eols}      YY_LINES; YY_STEP;
-  {blanks}    YY_STEP;
+  [ \f\n\t\v]+  YY_STEP;
+
   {id}        {
     yylval->symbol = symbol_get (yytext, *yylloc);
     rule_length++;
     return ID;
   }
 
-  {int}       yylval->integer = strtol (yytext, 0, 10); return INT;
+  {int} {
+    unsigned long num;
+    errno = 0;
+    num = strtoul (yytext, 0, 10);
+    if (INT_MAX < num || errno)
+      {
+       complain_at (*yylloc, _("%s is invalid"), yytext);
+       num = INT_MAX;
+      }
+    yylval->integer = num;
+    return INT;
+  }
 
   /* Characters.  We don't check there is only one.  */
   "'"         YY_OBS_GROW; yy_push_state (SC_ESCAPED_CHARACTER);
@@ -174,7 +281,7 @@ blanks   [ \t\f]+
   "\""        YY_OBS_GROW; yy_push_state (SC_ESCAPED_STRING);
 
   /* Comments. */
-  "/*"        yy_push_state (SC_COMMENT);
+  "/*"        BEGIN SC_YACC_COMMENT;
   "//".*      YY_STEP;
 
   /* Prologue. */
@@ -184,7 +291,7 @@ blanks   [ \t\f]+
   "{"         YY_OBS_GROW; ++braces_level; yy_push_state (SC_BRACED_CODE);
 
   /* A type. */
-  "<"[^>]+">" {
+  "<"{tag}">" {
     obstack_grow (&string_obstack, yytext + 1, yyleng - 2);
     YY_OBS_FINISH;
     yylval->string = last_string;
@@ -206,41 +313,48 @@ blanks   [ \t\f]+
 }
 
 
-  /*------------------------------------------------------------.
-  | Whatever the start condition (but those which correspond to |
-  | entity `swallowed' by Bison: SC_ESCAPED_STRING and          |
-  | SC_ESCAPED_CHARACTER), no M4 character must escape as is.   |
-  `------------------------------------------------------------*/
+  /*-------------------------------------------------------------------.
+  | Whatever the start condition (but those which correspond to        |
+  | entities `swallowed' by Bison: SC_YACC_COMMENT, SC_ESCAPED_STRING, |
+  | and SC_ESCAPED_CHARACTER), no M4 character must escape as is.      |
+  `-------------------------------------------------------------------*/
 
-<SC_COMMENT,SC_STRING,SC_CHARACTER,SC_BRACED_CODE,SC_PROLOGUE,SC_EPILOGUE>
+<SC_COMMENT,SC_LINE_COMMENT,SC_STRING,SC_CHARACTER,SC_BRACED_CODE,SC_PROLOGUE,SC_EPILOGUE>
 {
-  \[          if (YY_START != SC_COMMENT) obstack_sgrow (&string_obstack, 
"@<:@");
-  \]          if (YY_START != SC_COMMENT) obstack_sgrow (&string_obstack, 
"@:>@");
+  \[   obstack_sgrow (&string_obstack, "@<:@");
+  \]   obstack_sgrow (&string_obstack, "@:>@");
 }
 
 
+  /*---------------------------------------------------------------.
+  | Scanning a Yacc comment.  The initial `/ *' is already eaten.  |
+  `---------------------------------------------------------------*/
 
-  /*-----------------------------------------------------------.
-  | Scanning a C comment. The initial `/ *' is already eaten.  |
-  `-----------------------------------------------------------*/
-
-<SC_COMMENT>
+<SC_YACC_COMMENT>
 {
-  "*/" { /* End of the comment. */
-    if (yy_top_state () == INITIAL)
-      {
-       YY_STEP;
-      }
-    else
-      {
-       YY_OBS_GROW;
-      }
-    yy_pop_state ();
+  "*/" {
+    YY_STEP;
+    BEGIN INITIAL;
   }
 
-  [^\[\]*\n\r]+        if (yy_top_state () != INITIAL) YY_OBS_GROW;
-  {eols}       if (yy_top_state () != INITIAL) YY_OBS_GROW; YY_LINES;
-  .             /* Stray `*'. */if (yy_top_state () != INITIAL) YY_OBS_GROW;
+  [^*]+|"*"  ;
+
+  <<EOF>> {
+    LOCATION_PRINT (stderr, *yylloc);
+    fprintf (stderr, _(": unexpected end of file in a comment\n"));
+    BEGIN INITIAL;
+  }
+}
+
+
+  /*------------------------------------------------------------.
+  | Scanning a C comment.  The initial `/ *' is already eaten.  |
+  `------------------------------------------------------------*/
+
+<SC_COMMENT>
+{
+  "*"{splice}"/"  YY_OBS_GROW; yy_pop_state ();
+  [^*\[\]]+|"*"   YY_OBS_GROW;
 
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
@@ -250,6 +364,18 @@ blanks   [ \t\f]+
 }
 
 
+  /*--------------------------------------------------------------.
+  | Scanning a line comment.  The initial `//' is already eaten.  |
+  `--------------------------------------------------------------*/
+
+<SC_LINE_COMMENT>
+{
+  "\n"                  YY_OBS_GROW; yy_pop_state ();
+  ([^\n\[\]]|{splice})+  YY_OBS_GROW;
+  <<EOF>>               yy_pop_state ();
+}
+
+
   /*----------------------------------------------------------------.
   | Scanning a C string, including its escapes.  The initial `"' is |
   | already eaten.                                                  |
@@ -267,9 +393,7 @@ blanks   [ \t\f]+
     return STRING;
   }
 
-  [^\"\n\r\\]+      YY_OBS_GROW;
-
-  {eols}    obstack_1grow (&string_obstack, '\n'); YY_LINES;
+  [^\"\\]+  YY_OBS_GROW;
 
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
@@ -305,9 +429,7 @@ blanks   [ \t\f]+
     }
   }
 
-  [^\n\r\\] YY_OBS_GROW;
-
-  {eols}    obstack_1grow (&string_obstack, '\n'); YY_LINES;
+  [^'\\]+  YY_OBS_GROW;
 
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
@@ -327,9 +449,9 @@ blanks   [ \t\f]+
 
 <SC_ESCAPED_STRING,SC_ESCAPED_CHARACTER>
 {
-  \\[0-7]{3}           {
-    long c = strtol (yytext + 1, 0, 8);
-    if (c > 255)
+  \\[0-7]{1,3} {
+    unsigned long c = strtoul (yytext + 1, 0, 8);
+    if (UCHAR_MAX < c)
       {
        LOCATION_PRINT (stderr, *yylloc);
        fprintf (stderr, _(": invalid escape: %s\n"), quote (yytext));
@@ -339,8 +461,18 @@ blanks   [ \t\f]+
       obstack_1grow (&string_obstack, c);
   }
 
-  \\x[0-9a-fA-F]{2}    {
-    obstack_1grow (&string_obstack, strtol (yytext + 2, 0, 16));
+  \\x[0-9a-fA-F]+ {
+    unsigned long c;
+    errno = 0;
+    c = strtoul (yytext + 2, 0, 16);
+    if (UCHAR_MAX < c || errno)
+      {
+       LOCATION_PRINT (stderr, *yylloc);
+       fprintf (stderr, _(": invalid escape: %s\n"), quote (yytext));
+       YY_STEP;
+      }
+    else
+      obstack_1grow (&string_obstack, c);
   }
 
   \\a  obstack_1grow (&string_obstack, '\a');
@@ -350,7 +482,18 @@ blanks   [ \t\f]+
   \\r  obstack_1grow (&string_obstack, '\r');
   \\t  obstack_1grow (&string_obstack, '\t');
   \\v  obstack_1grow (&string_obstack, '\v');
-  \\[\\""'']   obstack_1grow (&string_obstack, yytext[1]);
+  \\[\"'?\\]  obstack_1grow (&string_obstack, yytext[1]);
+  \\(u|U[0-9a-fA-F]{4})[0-9a-fA-F]{4} {
+    int c = convert_ucn_to_byte (yytext);
+    if (c < 0)
+      {
+       LOCATION_PRINT (stderr, *yylloc);
+       fprintf (stderr, _(": invalid escape: %s\n"), quote (yytext));
+       YY_STEP;
+      }
+    else
+      obstack_1grow (&string_obstack, c);
+  }
   \\(.|\n)     {
     LOCATION_PRINT (stderr, *yylloc);
     fprintf (stderr, _(": unrecognized escape: %s\n"), quote (yytext));
@@ -374,13 +517,12 @@ blanks   [ \t\f]+
     yy_pop_state ();
   }
 
-  [^\[\]\'\n\r\\]+     YY_OBS_GROW;
-  \\(.|\n)             YY_OBS_GROW;
-  /* FLex wants this rule, in case of a `\<<EOF>>'. */
+  [^'\[\]\\]+         YY_OBS_GROW;
+  \\{splice}[^\[\]]    YY_OBS_GROW;
+  {splice}            YY_OBS_GROW;
+  /* Needed for `\<<EOF>>', `\\<<newline>>[', and `\\<<newline>>]'.  */
   \\                   YY_OBS_GROW;
 
-  {eols}               YY_OBS_GROW; YY_LINES;
-
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
     fprintf (stderr, _(": unexpected end of file in a character\n"));
@@ -403,13 +545,12 @@ blanks   [ \t\f]+
     yy_pop_state ();
   }
 
-  [^\[\]\"\n\r\\]+      YY_OBS_GROW;
-  \\(.|\n)              YY_OBS_GROW;
-  /* FLex wants this rule, in case of a `\<<EOF>>'. */
+  [^\"\[\]\\]+        YY_OBS_GROW;
+  \\{splice}[^\[\]]    YY_OBS_GROW;
+  {splice}            YY_OBS_GROW;
+  /* Needed for `\<<EOF>>', `\\<<newline>>[', and `\\<<newline>>]'.  */
   \\                   YY_OBS_GROW;
 
-  {eols}                YY_OBS_GROW; YY_LINES;
-
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
     fprintf (stderr, _(": unexpected end of file in a string\n"));
@@ -432,8 +573,8 @@ blanks   [ \t\f]+
   "\""        YY_OBS_GROW; yy_push_state (SC_STRING);
 
   /* Comments. */
-  "/*"        YY_OBS_GROW; yy_push_state (SC_COMMENT);
-  "//".*      YY_OBS_GROW;
+  "/"{splice}"*"  YY_OBS_GROW; yy_push_state (SC_COMMENT);
+  "/"{splice}"/"  YY_OBS_GROW; yy_push_state (SC_LINE_COMMENT);
 
   /* Not comments. */
   "/"         YY_OBS_GROW;
@@ -461,15 +602,14 @@ blanks   [ \t\f]+
 
   "{"                  YY_OBS_GROW; braces_level++;
 
-  "$"("<"[^>]+">")?(-?[0-9]+|"$") { handle_dollar (current_braced_code,
+  "$"("<"{tag}">")?(-?[0-9]+|"$") { handle_dollar (current_braced_code,
                                                   yytext, *yylloc); }
   "@"(-?[0-9]+|"$")               { handle_at (current_braced_code,
                                               yytext, *yylloc); }
 
-  address@hidden/\'\"\{\}\n\r]+ YY_OBS_GROW;
-  {eols}       YY_OBS_GROW; YY_LINES;
+  address@hidden/'\"\{\}]+     YY_OBS_GROW;
 
-  /* A lose $, or /, or etc. */
+  /* A stray $, or /, or etc. */
   .             YY_OBS_GROW;
 
   <<EOF>> {
@@ -497,9 +637,8 @@ blanks   [ \t\f]+
     return PROLOGUE;
   }
 
-  [^%\[\]/\'\"\n\r]+ YY_OBS_GROW;
+  [^%\[\]/'\"]+      YY_OBS_GROW;
   "%"                YY_OBS_GROW;
-  {eols}            YY_OBS_GROW; YY_LINES;
 
   <<EOF>> {
     LOCATION_PRINT (stderr, *yylloc);
@@ -514,12 +653,12 @@ blanks   [ \t\f]+
 
   /*---------------------------------------------------------------.
   | Scanning the epilogue (everything after the second "%%", which |
-  | has already been eaten.                                        |
+  | has already been eaten).                                       |
   `---------------------------------------------------------------*/
 
 <SC_EPILOGUE>
 {
-  ([^\[\]]|{eols})+  YY_OBS_GROW;
+  [^\[\]]+  YY_OBS_GROW;
 
   <<EOF>> {
     yy_pop_state ();
@@ -568,14 +707,15 @@ handle_action_dollar (char *text, locati
       obstack_fgrow1 (&string_obstack,
                      "]b4_lhs_value([%s])[", type_name);
     }
-  else if (('0' <= *cp && *cp <= '9') || *cp == '-')
+  else
     {
-      int n = strtol (cp, &cp, 10);
+      long num;
+      errno = 0;
+      num = strtol (cp, 0, 10);
 
-      if (n > rule_length)
-       complain_at (location, _("invalid value: %s%d"), "$", n);
-      else
+      if (INT_MIN <= num && num <= rule_length && ! errno)
        {
+         int n = num;
          if (!type_name && n > 0)
            type_name = symbol_list_n_type_name_get (current_rule, location,
                                                     n);
@@ -588,16 +728,14 @@ handle_action_dollar (char *text, locati
                          "]b4_rhs_value([%d], [%d], [%s])[",
                          rule_length, n, type_name);
        }
-    }
-  else
-    {
-      complain_at (location, _("%s is invalid"), quote (text));
+      else
+       complain_at (location, _("invalid value: %s"), text);
     }
 }
 
 
 /*---------------------------------------------------------------.
-| TEXT is expexted tp be $$ in some code associated to a symbol: |
+| TEXT is expected to be $$ in some code associated to a symbol: |
 | destructor or printer.                                         |
 `---------------------------------------------------------------*/
 
@@ -608,7 +746,7 @@ handle_symbol_code_dollar (char *text, l
   if (*cp == '$')
     obstack_sgrow (&string_obstack, "]b4_dollar_dollar[");
   else
-    complain_at (location, _("%s is invalid"), quote (text));
+    complain_at (location, _("%s is invalid"), quote_n (1, text));
 }
 
 
@@ -650,25 +788,26 @@ handle_action_at (char *text, location_t
     {
       obstack_sgrow (&string_obstack, "]b4_lhs_location[");
     }
-  else if (('0' <= *cp && *cp <= '9') || *cp == '-')
+  else
     {
-      int n = strtol (cp, &cp, 10);
+      long num;
+      errno = 0;
+      num = strtol (cp, 0, 10);
 
-      if (n > rule_length)
-       complain_at (location, _("invalid value: %s%d"), "@", n);
+      if (INT_MIN <= num && num <= rule_length && ! errno)
+       {
+         int n = num;
+         obstack_fgrow2 (&string_obstack, "]b4_rhs_location([%d], [%d])[",
+                         rule_length, n);
+       }
       else
-       obstack_fgrow2 (&string_obstack, "]b4_rhs_location([%d], [%d])[",
-                       rule_length, n);
-    }
-  else
-    {
-      complain_at (location, _("%s is invalid"), quote (text));
+       complain_at (location, _("invalid value: %s"), text);
     }
 }
 
 
 /*---------------------------------------------------------------.
-| TEXT is expexted tp be @$ in some code associated to a symbol: |
+| TEXT is expected to be @$ in some code associated to a symbol: |
 | destructor or printer.                                         |
 `---------------------------------------------------------------*/
 
@@ -679,7 +818,7 @@ handle_symbol_code_at (char *text, locat
   if (*cp == '$')
     obstack_sgrow (&string_obstack, "]b4_at_dollar[");
   else
-    complain_at (location, _("%s is invalid"), quote (text));
+    complain_at (location, _("%s is invalid"), quote_n (1, text));
 }
 
 
@@ -703,6 +842,62 @@ handle_at (braced_code_t braced_code_kin
       handle_symbol_code_at (text, location);
       break;
     }
+}
+
+
+/*------------------------------------------------------------------.
+| Convert universal character name UCN to a single-byte character,  |
+| and return that character.  Return -1 if UCN does not correspond  |
+| to a single-byte character.                                      |
+`------------------------------------------------------------------*/
+
+static int
+convert_ucn_to_byte (char const *ucn)
+{
+  unsigned long code = strtoul (ucn + 2, 0, 16);
+
+  /* FIXME: Currently we assume Unicode-compatible unibyte characters
+     on ASCII hosts (i.e., Latin-1 on hosts with 8-bit bytes).  On
+     non-ASCII hosts we support only the portable C character set.
+     These limitations should be removed once we add support for
+     multibyte characters.  */
+
+  if (UCHAR_MAX < code)
+    return -1;
+
+#if ! ('$' == 0x24 && '@' == 0x40 && '`' == 0x60 && '~' == 0x7e)
+  {
+    /* A non-ASCII host.  Use CODE to index into a table of the C
+       basic execution character set, which is guaranteed to exist on
+       all Standard C platforms.  This table also includes '$', '@',
+       and '`', which not in the basic execution character set but
+       which are unibyte characters on all the platforms that we know
+       about.  */
+    static signed char const table[] =
+      {
+       '\0',   -1,   -1,   -1,   -1,   -1,   -1, '\a',
+       '\b', '\t', '\n', '\v', '\f', '\r',   -1,   -1,
+         -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,
+         -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,
+        ' ',  '!',  '"',  '#',  '$',  '%',  '&', '\'',
+        '(',  ')',  '*',  '+',  ',',  '-',  '.',  '/',
+        '0',  '1',  '2',  '3',  '4',  '5',  '6',  '7',
+        '8',  '9',  ':',  ';',  '<',  '=',  '>',  '?',
+        '@',  'A',  'B',  'C',  'D',  'E',  'F',  'G',
+        'H',  'I',  'J',  'K',  'L',  'M',  'N',  'O',
+        'P',  'Q',  'R',  'S',  'T',  'U',  'V',  'W',
+        'X',  'Y',  'Z',  '[', '\\',  ']',  '^',  '_',
+        '`',  'a',  'b',  'c',  'd',  'e',  'f',  'g',
+        'h',  'i',  'j',  'k',  'l',  'm',  'n',  'o',
+        'p',  'q',  'r',  's',  't',  'u',  'v',  'w',
+        'x',  'y',  'z',  '{',  '|',  '}',  '~'
+      };
+
+    code = code < sizeof table ? table[code] : -1;
+  }
+#endif
+      
+  return code;
 }
 
 
Index: tests/input.at
===================================================================
RCS file: /cvsroot/bison/bison/tests/input.at,v
retrieving revision 1.12
diff -p -u -r1.12 input.at
--- tests/input.at      14 Oct 2002 08:43:36 -0000      1.12
+++ tests/input.at      3 Nov 2002 08:33:24 -0000
@@ -97,6 +97,22 @@ AT_DATA([input.y],
 /* This is seen in GCC: a %{ and %} in middle of a comment. */
 const char *foo = "So %{ and %} can be here too.";
 
+#ifdef __STDC__
+/\
+* A comment with backslash-newlines in it. %{ %} *\
+\
+/
+
+char str[] = "\\
+" A string with backslash-newlines in it %{ %} \\
+"";
+
+char apostrophe = '\\
+\
+'\
+';
+#endif
+
 #include <stdio.h>
 %}
 /* %{ and %} can be here too. */
@@ -128,14 +144,14 @@ static void yyerror (const char *s);
 static int yylex (void);
 %}
 
-%type <ival> '1'
+%type <ival> '@<:@'
 
 /* Exercise quotes in strings.  */
-%token FAKE "fake @<:@@:>@,"
+%token FAKE "fake @<:@@:>@ \a\b\f\n\r\t\v\"\'\?\\\u005B\U0000005c 
??!??'??(??)??-??/??<??=??> \x0\0"
 
 %%
-/* Exercise M4 quoting: '@:>@@:>@', 1.  */
-exp: '1'
+/* Exercise M4 quoting: '@:>@@:>@', @<:@, 1.  */
+exp: '@<:@' '\1' '\x000000000000000000000000000000000000000000000000002'
   {
     /* Exercise quotes in braces.  */
     char tmp[] = "@<:@%c@:>@,\n";
@@ -143,7 +159,7 @@ exp: '1'
   }
 ;
 %%
-/* Exercise M4 quoting: '@:>@@:>@', 2.  */
+/* Exercise M4 quoting: '@:>@@:>@', @<:@, 2.  */
 
 static YYSTYPE
 value_t_as_yystype (value_t val)
@@ -156,7 +172,7 @@ value_t_as_yystype (value_t val)
 static int
 yylex (void)
 {
-  static const char *input = "1";
+  static const char *input = "@<:@\1\2";
   yylval = value_t_as_yystype (*input);
   return *input++;
 }
@@ -184,7 +200,7 @@ main (void)
 AT_CHECK([bison -d -v -o input.c input.y])
 AT_COMPILE([input], [input.c main.c])
 AT_PARSER_CHECK([./input], 0,
-[[[1],
+[[[@<:@],
 ]])
 
 AT_CLEANUP
--- /dev/null   2002-11-03 08:20:24.000000000 +0000
+++ lib/mbswidth.c      2001-09-22 14:43:52.000000000 +0000
@@ -0,0 +1,218 @@
+/* Determine the number of screen columns needed for a string.
+   Copyright (C) 2000-2001 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program; if not, write to the Free Software Foundation,
+   Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
+
+/* Written by Bruno Haible <address@hidden>.  */
+
+#ifdef HAVE_CONFIG_H
+# include <config.h>
+#endif
+
+/* Specification.  */
+#include "mbswidth.h"
+
+/* Get MB_CUR_MAX.  */
+#include <stdlib.h>
+
+#include <string.h>
+
+/* Get isprint().  */
+#include <ctype.h>
+
+/* Get mbstate_t, mbrtowc(), mbsinit(), wcwidth().  */
+#if HAVE_WCHAR_H
+# include <wchar.h>
+#endif
+
+/* Get iswprint(), iswcntrl().  */
+#if HAVE_WCTYPE_H
+# include <wctype.h>
+#endif
+#if !defined iswprint && !HAVE_ISWPRINT
+# define iswprint(wc) 1
+#endif
+#if !defined iswcntrl && !HAVE_ISWCNTRL
+# define iswcntrl(wc) 0
+#endif
+
+#ifndef mbsinit
+# if !HAVE_MBSINIT
+#  define mbsinit(ps) 1
+# endif
+#endif
+
+#ifndef HAVE_DECL_WCWIDTH
+"this configure-time declaration test was not run"
+#endif
+#if !HAVE_DECL_WCWIDTH
+int wcwidth ();
+#endif
+
+#ifndef wcwidth
+# if !HAVE_WCWIDTH
+/* wcwidth doesn't exist, so assume all printable characters have
+   width 1.  */
+#  define wcwidth(wc) ((wc) == 0 ? 0 : iswprint (wc) ? 1 : -1)
+# endif
+#endif
+
+/* Get ISPRINT.  */
+#if defined (STDC_HEADERS) || (!defined (isascii) && !defined (HAVE_ISASCII))
+# define IN_CTYPE_DOMAIN(c) 1
+#else
+# define IN_CTYPE_DOMAIN(c) isascii(c)
+#endif
+/* Undefine to protect against the definition in wctype.h of solaris2.6.   */
+#undef ISPRINT
+#define ISPRINT(c) (IN_CTYPE_DOMAIN (c) && isprint (c))
+#undef ISCNTRL
+#define ISCNTRL(c) (IN_CTYPE_DOMAIN (c) && iscntrl (c))
+
+/* Returns the number of columns needed to represent the multibyte
+   character string pointed to by STRING.  If a non-printable character
+   occurs, and MBSW_REJECT_UNPRINTABLE is specified, -1 is returned.
+   With flags = MBSW_REJECT_INVALID | MBSW_REJECT_UNPRINTABLE, this is
+   the multibyte analogon of the wcswidth function.  */
+int
+mbswidth (string, flags)
+     const char *string;
+     int flags;
+{
+  return mbsnwidth (string, strlen (string), flags);
+}
+
+/* Returns the number of columns needed to represent the multibyte
+   character string pointed to by STRING of length NBYTES.  If a
+   non-printable character occurs, and MBSW_REJECT_UNPRINTABLE is
+   specified, -1 is returned.  */
+int
+mbsnwidth (string, nbytes, flags)
+     const char *string;
+     size_t nbytes;
+     int flags;
+{
+  const char *p = string;
+  const char *plimit = p + nbytes;
+  int width;
+
+  width = 0;
+#if HAVE_MBRTOWC
+  if (MB_CUR_MAX > 1)
+    {
+      while (p < plimit)
+       switch (*p)
+         {
+           case ' ': case '!': case '"': case '#': case '%':
+           case '&': case '\'': case '(': case ')': case '*':
+           case '+': case ',': case '-': case '.': case '/':
+           case '0': case '1': case '2': case '3': case '4':
+           case '5': case '6': case '7': case '8': case '9':
+           case ':': case ';': case '<': case '=': case '>':
+           case '?':
+           case 'A': case 'B': case 'C': case 'D': case 'E':
+           case 'F': case 'G': case 'H': case 'I': case 'J':
+           case 'K': case 'L': case 'M': case 'N': case 'O':
+           case 'P': case 'Q': case 'R': case 'S': case 'T':
+           case 'U': case 'V': case 'W': case 'X': case 'Y':
+           case 'Z':
+           case '[': case '\\': case ']': case '^': case '_':
+           case 'a': case 'b': case 'c': case 'd': case 'e':
+           case 'f': case 'g': case 'h': case 'i': case 'j':
+           case 'k': case 'l': case 'm': case 'n': case 'o':
+           case 'p': case 'q': case 'r': case 's': case 't':
+           case 'u': case 'v': case 'w': case 'x': case 'y':
+           case 'z': case '{': case '|': case '}': case '~':
+             /* These characters are printable ASCII characters.  */
+             p++;
+             width++;
+             break;
+           default:
+             /* If we have a multibyte sequence, scan it up to its end.  */
+             {
+               mbstate_t mbstate;
+               memset (&mbstate, 0, sizeof mbstate);
+               do
+                 {
+                   wchar_t wc;
+                   size_t bytes;
+                   int w;
+
+                   bytes = mbrtowc (&wc, p, plimit - p, &mbstate);
+
+                   if (bytes == (size_t) -1)
+                     /* An invalid multibyte sequence was encountered.  */
+                     {
+                       if (!(flags & MBSW_REJECT_INVALID))
+                         {
+                           p++;
+                           width++;
+                           break;
+                         }
+                       else
+                         return -1;
+                     }
+
+                   if (bytes == (size_t) -2)
+                     /* An incomplete multibyte character at the end.  */
+                     {
+                       if (!(flags & MBSW_REJECT_INVALID))
+                         {
+                           p = plimit;
+                           width++;
+                           break;
+                         }
+                       else
+                         return -1;
+                     }
+
+                   if (bytes == 0)
+                     /* A null wide character was encountered.  */
+                     bytes = 1;
+
+                   w = wcwidth (wc);
+                   if (w >= 0)
+                     /* A printable multibyte character.  */
+                     width += w;
+                   else
+                     /* An unprintable multibyte character.  */
+                     if (!(flags & MBSW_REJECT_UNPRINTABLE))
+                       width += (iswcntrl (wc) ? 0 : 1);
+                     else
+                       return -1;
+
+                   p += bytes;
+                 }
+               while (! mbsinit (&mbstate));
+             }
+             break;
+         }
+      return width;
+    }
+#endif
+
+  while (p < plimit)
+    {
+      unsigned char c = (unsigned char) *p++;
+
+      if (ISPRINT (c))
+       width++;
+      else if (!(flags & MBSW_REJECT_UNPRINTABLE))
+       width += (ISCNTRL (c) ? 0 : 1);
+      else
+       return -1;
+    }
+  return width;
+}
--- /dev/null   2002-11-03 08:20:24.000000000 +0000
+++ lib/mbswidth.h      2001-11-10 00:13:19.000000000 +0000
@@ -0,0 +1,45 @@
+/* Determine the number of screen columns needed for a string.
+   Copyright (C) 2000-2001 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program; if not, write to the Free Software Foundation,
+   Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
+
+#include <stddef.h>
+
+#ifndef PARAMS
+# if __STDC__ || defined __GNUC__ || defined __SUNPRO_C || defined __cplusplus 
|| __PROTOTYPES
+#  define PARAMS(Args) Args
+# else
+#  define PARAMS(Args) ()
+# endif
+#endif
+
+/* Optional flags to influence mbswidth/mbsnwidth behavior.  */
+
+/* If this bit is set, return -1 upon finding an invalid or incomplete
+   character.  Otherwise, assume invalid characters have width 1.  */
+#define MBSW_REJECT_INVALID 1
+
+/* If this bit is set, return -1 upon finding a non-printable character.
+   Otherwise, assume unprintable characters have width 0 if they are
+   control characters and 1 otherwise.  */
+#define MBSW_REJECT_UNPRINTABLE        2
+
+/* Returns the number of screen columns needed for STRING.  */
+#define mbswidth gnu_mbswidth  /* avoid clash with UnixWare 7.1.1 function */
+extern int mbswidth PARAMS ((const char *string, int flags));
+
+/* Returns the number of screen columns needed for the NBYTES bytes
+   starting at BUF.  */
+extern int mbsnwidth PARAMS ((const char *buf, size_t nbytes, int flags));
--- /dev/null   2002-11-03 08:20:24.000000000 +0000
+++ m4/mbswidth.m4      2002-06-21 17:41:02.000000000 +0000
@@ -0,0 +1,36 @@
+#serial 7
+
+dnl autoconf tests required for use of mbswidth.c
+dnl From Bruno Haible.
+
+AC_DEFUN([jm_PREREQ_MBSWIDTH],
+[
+  AC_REQUIRE([AC_HEADER_STDC])
+  AC_CHECK_HEADERS(limits.h stdlib.h string.h wchar.h wctype.h)
+  AC_CHECK_FUNCS(isascii iswcntrl iswprint mbsinit wcwidth)
+  jm_FUNC_MBRTOWC
+
+  AC_CACHE_CHECK([whether wcwidth is declared], ac_cv_have_decl_wcwidth,
+    [AC_TRY_COMPILE([
+/* AIX 3.2.5 declares wcwidth in <string.h>. */
+#if HAVE_STRING_H
+# include <string.h>
+#endif
+#if HAVE_WCHAR_H
+# include <wchar.h>
+#endif
+], [
+#ifndef wcwidth
+  char *p = (char *) wcwidth;
+#endif
+], ac_cv_have_decl_wcwidth=yes, ac_cv_have_decl_wcwidth=no)])
+  if test $ac_cv_have_decl_wcwidth = yes; then
+    ac_val=1
+  else
+    ac_val=0
+  fi
+  AC_DEFINE_UNQUOTED(HAVE_DECL_WCWIDTH, $ac_val,
+    [Define to 1 if you have the declaration of wcwidth(), and to 0 
otherwise.])
+
+  AC_TYPE_MBSTATE_T
+])
[Prev in Thread]
Current Thread
[Next in Thread]
Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert <=
- Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/04
  - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/04
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/05
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/05
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/06
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/06
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/07
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/05
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Paul Eggert, 2002/11/06
    - Re: Bison scanner patch to fix POSIX incompatibilities, etc., Akim Demaille, 2002/11/06
Prev by Date: type-clash diagnostic wording consistency
Next by Date: Re: Purity of yyerror etc.
Previous by thread: type-clash diagnostic wording consistency
Next by thread: Re: Bison scanner patch to fix POSIX incompatibilities, etc.
Index(es):
- Date
- Thread